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(54) Title: METHODS FOR OBTAINING AND USING HAPLOTYPE DATA 

^ (57) Abstract: Methods, computer program(s) and databa&e($) to analyze and make use of gene hs^lotype information. These in- 
elude methods, program, and database to find and measure the frequency of haplotypes in the general population; methods, program, 
and database to find correlation's between an individual*s haplotypes or genotypes and a clinical outcome; mediods, program, and 
database to ptedict an individual's haplotypes from die individual's genotype for a gene; and methods, program, and database to 

^ predict an individual's cUnical lesponse to a treatment based on' the individual's genotype hsqplotype. 
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L TITLE OF THE INVENTION 

METHODS FOR OBTAINING AND USING HAPLOTYPE DATA 

n. RELATED APPLICATIONS 

This application is a continuation-in-part of U.S, Application 
Serial No. 60/141,521 filed June 25, 1999, which is incorporated by reference 
herein. 



m. FIELD OF THE INVENTION 

The invention relates to the field of genomics, and genetics, 
including genome analysis and the study of DNA variation. In particular, the 
invention relates to the fields of pharmacogenetics and pharmacogenenomics and 
the use of genetic haplotype information to predict an individual's susceptibility to 
disease and/or their response to a particular drug or drugs, so that drugs tailored to 
genetic differences of population groups may be developed and/or administered to 
the appropriate population. 
20 The invention also relates to tools to analyze DNA, catalog 

variations in DNA, study gene function and link variations in DNA to an 
individuars susceptibility to a particular disease and/or response to a particular drug 
or drugs. 

2^ The invention may also be used to link variations in DNA to 

personal identity and racial or ethnic background. 

The invention also relates to the use of haplotype information 
in the veterinary and agricultural fields. 

30 IV. BACKGROUND OF THE INVENTION 

The accumulation of genomic information and technology is 
opening doors for the discovery of new diagnostics, preventive strategies, and drug 
therapies for a whole host of diseases, including diabetes, hypertension, heart 
J J disease, cancer, and mental illness. This is due to the fact that many human diseases 
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have genetic compone?iits, which may be evidenced by clustering in certain families, 
and/or in certain racial, ethnic or ethnogeographic (world population) groups. For 
example, prostrate cancer clusters in some families. Furthermore, while prostate 
cancer is common among aU U.S. males, it is especially common among African 
American men. They arc 35 percent more likely than Americans of European 
descent to develop the disease and more than twice as likely to die from it. A 
variation on chromosome 1 (HPCl) and a variation on the X chromosome (HPCX) 
appear to predispose men to prostrate cancer and a study is currently underway to 
test this hypothesis. 

Likewise, it is clear that an individual's genes can have 
considerable influence over how that individual responds to a particular drug or 
drags. 

Individuals inherit specific versions of enzymes that affect 
15 how they metabolize, absorb and excrete drugs. So far, researchers have identified 
several dozen enzymes that vary in their activity throughout the population and that 
probably dictate people's response to drugs - which may be good, bad or sometimes 
deadly. For example, the cytochrome P450 family of enzymes (of which CYP 2D6 
is a member) is involved in the metabolism of at least 20 percent of all commonly 
prescribed drugs, including the antidepressant Prozac ™^ the painkiller codeine, and 
high-blood-pressure medications such as captopril. Ethnic variation is also seen in 
this instance. Due to genetic differences in cytochrome P450, for example, 6 to 10 
percent of Whites, 5 percent of Blacks, and less than 1 percent of Asians are poor 
dmg metabolizers. 

One very troubling observation is that adverse reactions often 
occur in patients receiving a standard dose of a particular drug. As an example, 
doctors in the 1950s would administer a drug called succinylcholine to induce 
30 muscle relaxation in patients before surgery, A number of patients, however, never 
woke up from anesthesia - the compound paralyzed their breathii^ muscles and they 
suffocated. It was later discovered that the patients who died had inherited a mutant 
form of the enzyme that clears succinylcholine from their system. As another 
example, as early as the 1940s doctors noticed that certain tuberculosis patients 
treated with the antibacterial drug isoniazid would feel pain, tingling and weakness 
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in their limbs. These patients were unusually slow to clear the drug from their 
bodies - isoniazid must be rapidly converted to a nontoxic form by an enzyme called 
N-acetyltransferase. This difference in drug response was later discovered to be due 
to differences in the gene encoding the enzyme. The number of people who would 
5 experience adverse responses using this drug is not small. Forty to sixty per cent of 
Caucasians have the less active form of the enzyme (i.e., "slow acetylators"). 

Another gene encodes a liver enzyme that causes side effects 
in some patients who used Seldane™, an allergy drug which was removed from the 
10 market The drug Seldane™ is dangerous to people with liver disease, on 

antibiotics, or who are using the antifungal drug Nizoral. The major problem witfi 
Seldane*^ is that it can cause serious, potentially fatal, heart rhythm disturbances 
when more than the recommended dose is taken. The real danger is that it can 
interact with certain other drugs to cause this problem at usual doses. It was 
discovered that people with a particular version of a CYP4S0 suffered serious side 
effects when they took Seldane™ with the antibiotic erythromycin. 

Sometimes one ethnic group is affected more than others. 
During the Second World War, for example, Afirican-American soldiers given the 
antimalarial drug primaquine developed a severe form of anaemia; Ihie soldiers who 
became ill had a deficiency in an enzyme called glucose-6 -phosphate 
dehydrogenase (G6PD) due to a genetic variation that occurs in about 10 per cent of 
Africans, but very rarely in Caucasians. G6PD deficiency probably became more 
25 common in Africans because it confers some protection against malaria. 

Variations in certain genes can also determine whether a drug 
treats a disease effectively. For example, a cholesterol-lowering drug called 
pravastatin won't help people vnth high blood cholesterol if they have a common 
gene variant for an en^me called cholesteryl ester transfer protein (CETP). As 
another example, several studies suggest that the version of the "ApoE" gene that is 
associated with a high risk of developing Alzheimer's disease in old age (i.e., 
APOE4) correlates with a poor response to an Alzheimer's drug called tacrine. As 
yet another example, the drug Herceptin ™, a treatment for metastatic breast cancer, 
35 only works for patients whose tumors overproduce a certain protein, called HER2. 
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A screening test is given to all potential patients to weed out those on whom the 

drug won't be effective. 

In summary, it is well known that not all individuals respond 

identically to drugs for a given condition. Some people respond well to drug A but 
S poorly to drug B, some people respond better to drug B» while some have adverse 

reactions to both drugs. In many cases it is currently difficult to tell how an . 

individual person will respond to a given drug, except by having them try using it. 

It appears that a major reason people respond differently to a 
^ dmg is that they have different forms of one or more of the proteins that interact 

with the drug or that lie in the cascade initiated by taking the drug. 

A common method for determining the genetic differences 

between individuals is to find Single Nucleotide Polymorphisms (SNPs), which may . 

be either in or near a gene on the chromosome, that differ between at least some 
5 individuals in the population. A number of instances are known (Sickle Cell 

Anemia is a prototypical example) for which the nucleotide at a SNP is correlated 

with an individual's propensity to develop a disease. Often these SNPs are linked to 

the causative gene, but are not themselves causative. These are often called 
Q surrogate markers for the disease. The SNP/surrogate marker approach suffers from 

at least three problems: 

(1) Comprehensiveness: There are often several polymorphisms 

in any given gene, (See Ref. 10 for an example in which there are 88 polymorphic 

sites). Most SNP projects look at a large number of SNPs, but spread over an 

5 

enormous region of the chromosome. Therefore the probability of finding all (or 
any) SNPs in the coding region of a gene is small. The likelihood of finding the 
causative SNP(s) (the subset of polymorphisms responsible for causing a particular 
condition or change in response to a treatment) is even lower. 
0 (2) Lack of Linkage: If the causative SNP is in so-called linkage 

disequilibrium (Ref 1, Chapter 2) with the measured SNP, then the nucleotide at the 
measured SNP will be correlated with the nucleotide at the causative SNP. 
However it is impossible to predict a priori whether such linkage disequilibrium 
will exist for a particular pair of megisured and causative SNPs. 
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(3) Phasing: When there are multiple, interacting causative SNPs 
in a gene one needs to know what are the sequences of the two forms of the gene 
present in an individual. For instance, assume there is a gene that has 3 causative 
SNPs and that the remaining part of the gene is identical among all individuals. We 
can then identify the two copies of the gene that any individual has with only the 
nucleotides at those sites. Now assume that 4 forms exist in the population, labeled 
TAA, ATA, TTA and AAA. SNP methods effectively measure SNPs one at a time, 
and leave the "phasing" between nucleotides at different positions ambiguous. An 
individual with one copy of TAA and one of ATA would have a genotype 
(collection of SNPs) of [T/A, T/A, A/A]. This genotype is consistent with the 
haplotypes TTA/AAA or TAA/ATA. An individual with one copy of TTA and one 
of AAA would have exactly the same genotype as an individual with one copy of 
TAA and one copy of ATA. By using unphased genotypes, we cannot distinguish 
15 these two individuals. 

A relatively low density SNP based map of the genome will 
have little likelihood of specifically identifying drug target variations that will allow 
for distinguishing responders from poor responders, non-responders, or those likely 
20 to suffer side-effects (or toxicity) to drugs. A relatively low density SNP based map 
of the genome also will have little likelihood of providing information for new 
genetically based drug design. In contrast, using the data and analytical tools of the 
present invention, knowing all the polymorphisms in the haplotypes will provide a 
firm basis for pursuing pharmacogenetics of a drug or class of drugs. 

With the present invention, by knowing which forms of the 
proteins an individual possesses, in particular, by knowing that individual's 
haplotypes (which are the most detailed description of their genetic makeup for the 
genes of interest) for rationally chosen drug target genes, or genes intimately 
30 involved with the pathway of interest, and by knowing the typical response for 

people with those haplotypes, one can with confidence predict how that individual 
will respond to a drug. Doing this has the practical benefit that the best available 
. drug and/or dose for a patient can be prescribed immediately rather than relying on a 
trial and error approach to find the optimal drug. The end result is a reduction in 
cost to the health care system. Repeat visits to the physician's office are reduced, tiie 
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picscription of needletss drugs is avoided, and the number of adverse reactions is 
decreased. 

The Clinical Trials Solution (CTS™) method described herein . 
provides a process for finding correlation's between haplotypes and response to 
treatment and for developing protocols to test patients and predict their response to a 
particular treatment. 

The CTS"^** method is partially embodied in the DecoGen^ 
Platform, which is a computer program coupled to a database used to display and 
analyze genetic and clinical information. It includes novel graphical and 
computational methods for treating haplotypes, genotypes, and clinical data in a 
consistent and easy-to-interpret maimer. 



V. SUMMARY OF THE INVENTION 

15 The basis of the present invention is the fact that the specific 

form of a protein and the expression pattern of that protein in a particular individual 
are directly and unambiguously coded for by the individual's isogenes, which can 
be used to determine haplotypes. These haplotypes are more informative than the 
typically measured genotype, which retains a level of ambiguity about which form 
of the proteins will be expressed in an individual. By having unambiguous 
information about the forms of the protein causing the response to a treatment, one 
has the ability to accurately predict individuals' responses to that treatment. Such 
information can be used to predict drug efficacy and toxic side effects, lower the 
25 cost and risk of clinical trials, redefine and/or expand the markets for approved 
compounds (i.e., existing drugs)^ revive abandoned drugs, and help design more 
effective medications by identifying h^lotypes relevant to optimal therapeutic 
responses. Such information can also be used, e.g., to determine the correct drug 
dose to give a patient. 

At the molecular level, tiiere will be a direct correlation 
between the form and expression level of a protein and its mode or degree of action. 
By combining this unambiguous molecular level information (i.e., the haplotypes) 
with clinical outcomes (e.g. the response to a particular drug), one can find 
correlations between haplotypes and outcomes. These correlations can then be used 
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in a forward-looking mode to predict individuals' response to a drug. 

The invention also relates to methods of making informative 
linkages between gene inheritance, disease susceptibility and how organisms react 
to drugs. 

The invention relates to methods and tools to individually 
design diagnostic tests, and therapeutic strategies for maintaining health, preventing 
disease, and improving treatment outcomes, in situations where subtle genetic 
differences may contribute to disease risk and response to particular therapies. 

The method and tools of the invention provide the ability to 
determine the frequency of each isogene, in particular, its haplotype, in the major 
ethno-geographic groups, as well as disease populations. 

Similarly, in agricultural biotechnology, the method and tools 
of the invention can be used to determine the frequency of isogenes responsible for 
15 specific desirable traits, e.g., drought tolerance and/or improved crop yields, and 
reduce the time and effort needed to transfer desirable tmits. 

The invention includes methods, computer program(s) and 
database(s) to analyze and make use of gene haplotype information. These include 
2Q methods, program, and database to find and measure the frequency of haplotypes in 
the general population; methods, program, and database to find correlation's 
between an individuals' haplotypes or genotypes and a clinical outcome; methods, 
program, and database to predict an individual's haplotypes from the individual's 
genotype for a gene; and methods, program, and database to predict an individual's 
clinical response to a treatment based on the individual's genotype or haplotype. 

The invention also relates to methods of constructing a 
haplotype database for a population, comprising: 

(a) identifying individuals to include in the population; 
30 (b) determining haplotype data for each individual in the 

population from isogene information; 

(c) organizing the haplotype data for the individuals in 
the population into fields; and 

(d) storing the haplotype data for individuals in the 
population according to the fields. 
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The invention also relates to methods of predicting the 
presence of a haplotype pair in an individual comprising, in order: 

(a) identifying a genotype for the individual; 

(b) enumerating all possible haplotype pairs which are 
5 consistent with the genotype; 

(c) accessing a database containing reference haplotype 
pair frequency data to determine a probability, for 
each of the possible haplotype pairs, that the 
individual has a possible haplotype pair; and 

(d) analyzing the determined probabilities to predict 
haplotype pairs for the individual. 

The invention also relates to methods for identifying a 
correlation between a haplotype pair and a clinical response to a treatment 
15 comprising: 

(a) accessing a database containing data on clinical 
responses to treatments exhibited by a clinical 
population; 

2Q (b) selecting a candidate locus hypothesized to be 

associated with the clinical response, the locus 
comprising at least two polymorphic sites; 
i^) generating haplotype data for each member of the 
clinical population, the haplotype data comprising 

25 

information on a plurality of polymorphic sites 
present in the candidate locus; 

(d) storing the haplotype data; and 

(e) identifying the correlation by analyzing the haplotype 
30 and clinical response data 

The invention also relates to methods for identifying a 
. correlation between a haplotype pair and susceptibility to a disease comprising the 
. steps of: 

35 
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(a) selecting a candidate locus hypothesized to be 
associated with the condition or disease, the locus 
comprising at least two polymorphic sites; 

(b) generating haplotype data for the candidate locus for 
each member of a disease population; 

(c) organizing the haplotype data in a database; 

(d) accessing a database containing reference baplotypes 
for the candidate locus; 

(e) identifying the correlation by analyzing the disease 
haplotype data and the reference haplotype data 
wherein when a haplotype pair has a higher frequency 
in the disease population than in the reference 
population, a correlation of the haplotype pair to a 

15 susceptibility to the disease is identified. 

The invention also relates to methods of predicting response 
to a treatment comprising: 

(a) selecting at least one candidate gene which exhibits a 
correlation between haplotype content and at least two 
different responses to the treatment; 

(b) determining a haplotype pair of an individual for the 
candidate gene; 

(c) comparing the individual's haplotype pair with stored 
information on the correlation; and 

(d) predicting the individual's response as a result of the 
comparing. 

The invention also provides computer systems which are 
30 programmed with program code which causes the computer to carry out many of the 
mettiods of the invention. A range of computer types may be employed; suitable 
computer systems include but are not limited to computers dedicated to the methods 
of the invention, and general-purpose programmable computers. The invention 
further provides computer-usable media having computer-readable program code 
stored thereon, for causing a computer to carry out many of the methods of the 
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invention. Computer-usable media includes, but is not limited to, solid-state 
memoiy chips, magnetic tapes, or magnetic or optical disks. The invention also 
provides database structures which are adapted for use with the computers, program 
code, and methods of the invention. 

VI. BMEF DESCRIPTION OF THE DRAWINGS 

FIGURE 1. System Architecture Schematic. 

FIGURE 2. Pathway/Gene Collection View. This screen 
shows a schematic of candidate genes from which a candidate gene may be selected 
to obtain further information. A menu on the left of the screen indicates some of the 
information about the candidate genes which may be accessed from a database. 





TNFRl 


- Tissue Necrosis Factor 1 


15 


ADBR2 


- Beta-2 Adrenergic Receptor 




IGERA 


- immunoglobulin E receptor alpha chain 




IGERB 


- immunoglobulin E receptor beta chain 




OCIF 


- ' osteoclastogenesis inhibitory factor 


20 


ERA 


- Estrogen alpha receptor 


IL-4R 


- interleukin 4 receptor 




5HT1A 


- 5 hydroxjrtryptamine receptor lA 




DRD2 


- dopamine receptor D2 




TNFA 


- tumor necrosis factor alpha 


25 


IL-IB 


- interleukin IB 




PTGS2 


- prostaglandin synthase 2. (COX-2) 




IL-4 


- interleukin 4 




IL-13 


- interleukin 13 


30 


CYP2D6 


- cytochrome P4S0 2D6 




HSERT 


- serotonin transporter 




UCP3 


- uncoupling protein 3 



35 



FIGURE 3. Gene Description View. This screen provides 
some of the basic information about the currently selected gene. 
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FIGURE 4A. Gene Structure View. This screen shows the 
location of features in the gene (such as promoter, introns, exons, etc.), the location 
of polymorphic sites in the gene for each haplotype and the number of times each 
haplotype was seen in various world population groups. 

FIGURE 4B. Gene Structure View (Cont.). This screen 
shows a screen which results after a gene feature is selected in the screen of 
FIGURE 4A, An expanded view of the selected gene feature is shown at the bottom 
of the screen. 

FIGURE 5. Sequence Alignment View. This screen shows 
an alignment of the full DNA sequences for all the haplotypes (i.e., the isogenes) 
which appears in a separate window when one of the features in FIGURE 4A or 4B 
is selected. The polymorphic positions are highlighted. 

FIGURE 6. mRNA Structure View. This screen shows the 
15 secondary structure of the RNA transcript for each isogene of the selected gene. 

FIGURE 7. Protein Structure View. This screen shows 
important motifs in the protein. The location of polymorphic sites in the protein is 
indicated by triangles. Selecting a triangle brings up information about the selected 
2Q polymorphism at the top of the screen. 

FIGURE 8. Population View. This screen shows information 
about each of the members of the population being analyzed. PID is a unique 
identifier. 

FIGURE 9. SNP Distribution View. This screen shows the 

25 

genotype to haplotype resolution of each of the individuals in the population being 
examined. 

FIGURE 10. Haplotype Frequencies (Summary View). This 
screen shows a summary of ethnic distribution as a function of haplotypes. 
30 FIGURE 1 1 . Haplotype Frequencies (Detailed View). This 

screen shows details of ethnic distribution as a function of haplotype. Numerical 
data is provided. 

FIGURE 12. Polymorphic Position Linkage View. This 
screen shows linkage between polymorphic sites in the population. 

FIGURE 13. Genotype Analysis View (Summary View). 
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This screen shows haplotyping identification reliability using genotyping at selected 
positions. 

FIGURE 14. Genotype Analysis View (Detailed View). This 
screen gives a number value for the graphical data presented in FIGURE 1 3 . 
5 FIGURE 15. Genotype Analysis View (Optimization View). 

This screen gives the results of a simple optimization approach to finding the 
simplest genotyping approach for predicting an individual's haplotypes. 

FIGURES 16 and 17. Haplotype Phylogenetic Views. These 
screens show minimal spanning networks for the haplotypes seen in the population. 

FIGURE 18. Clinical Measurements vs. Haplotype View 
(Summary). This screen shows a matrix summari^ng the correlation between 
clinical measurements and haplotypes. 

FIGURE 19. Clinical Measurements vs. Haplotype View 
15 (Distribution View). This screen shows the distribution of the patients in each cell 
of the matrix of FIGURE 18. 

FIGURE 20. Expanded view of one haplotype-pair 
distribution. This screen results when a user selects a cell in the matrix in FIGURE 
2Q 1 9. The screen shows the number of patients in the various response bins indicated 
on the horizontal axis. 

FIGURE 21. Linear Regression Analysis View. This screen 
shows the results of a dose-response linear regression calculation on each of the 
individual polymorphisms 

FIGURE 22. Clinical Measurements vs. Haplotype View 
(Details). This screen gives the mean and standard deviation for each of the cells in 
FIGURE 18. 

FIGURE 23 . CUnical Measurement ANOV A calculation. 
30 This screen shows the statistical significance between haplotype pair groups and 
clinical response. 

FIGURE 24. Interface to the DecoGen CTS Modeler. As 
. described in the text, a genetic algorithm (GA) is used to find an optimal set of 
weights to fit a function of the subject haplotype data to the clinical response. The 
controls at the right of the page are used to set the number of GA generations, the 
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size of the population of "agents" that coevolve during the GA simulation, and the 
GA mutation and crossover rates. The GA population, and population parameters 
with those of the real human subjects, should not be confused. These are simply 
terms used in the computational algorithm which is the GA. The GA is an error- 
5 minimizing approach, where the error is a weighted sum of differences between the 
predicted clinical response and that which is measured. The graph in the top-middle 
shows the residual error as a function of computational time, measured in 
generations. The bar graph at the bottom center shows the weights from Equation 6 
for the best solution found so far in the GA simulation. 

10 

FIGURE 25A. Gene Repository data submodel. 

FIGURE 25B. Population Repository data submodel. 

FIGURE 25C. Polymorphism Repository data submodel. 

FIGURE 25D. Sequence Repository data submodel. 
15 FIGURE 25E. Assay Repository data submodel. 

FIGURE 25F, Legend of symbols in FIGURES 25A-E. 

FIGURE 26. Pathway View. This screen shows a schematic 
of candidate genes relevant to asthma from which a candidate gene may be selected 
2Q ■ to obtain further information. This view is an alternative way of showing 

information similar to that described in the Pathway/Gene Collection View shown 
in FIGURE 2, with access to additional views, projects and other infonnation, as 
well as additional tools, A menu on the left of the screen in FIGURE 26 indicates 
some of the information about the candidate genes which may be accessed from a 

25 

database. The candidates genes shown are 

ADBR2 - Beta-2 Adrenergic Receptor 
IL-9 - Interleukin9 

PDE6B - Phosphodiesterase 6B 
30 CALMl - Calmodulin 1 

J AK3 - Janus Tyrosine Kinase 3 
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The following is a description about what happens (or could 
be made to happen) when each of the items on top of the screens (e.g., "File", 
"Edit", "Subsets", "Action", "Tools", "Help") are selected: 
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• File: 
New 
Open 
Save 
Save As 
Exit 

"File" lets the viewer select the ability to open or save a 
project filcy which contains a list of genes to be viewed. 

• Edit: 
Cut 
Copy 
Paste 

15 • Subsets: 

^^Subsets^^ allows the user to create and select for analysis 
subsets of the total patient set. Once a subset has been defined and named, the name 
of the subset goes into the pulldown under this menu. Functions are available to 
select a subset of patients based on clinical value ("Select everyone with a 
choleserol level > 200"),-or ethnicity, or genetic makeup ("Select all patients with 
haplotype CAGGCTGG for gene DAXX"), etc. 

• Action: 
Redo 

"Redo" will cause displays to be regenerated when, for 
instance, the active set of SNPs has been changed. 



20 



25 



• Tools: 
"Tools 

calculator for calculating etc. 



"Tools" will bring up various utilities, such as a statistics 



Help: 

"Help** will bring up on-line help for various functions. 



35 



The following is a description of the Standard Buttons that 
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occur on all screens: 

• New (blank sheet)- standard windows button for creating new 
file - this creates a new project 

• Open (open folder) - standard windows button for opening 
existing file - open an existing project 

• Save (picture of floppy disk) - save the current project to a file 

• Save 2"^ version — save the currently selected set of idividuals or 
genes to a collection that can be separately analyzed. 

• Print (picture of printer) - print the current page 

• Cut (scissors) - delete the selected items (could be a gene or 
genes, a person, a SNP, etc., depending on the context) 



• Copy - cbpy the selected item (as above) to the clipboard 

20 • Paste - paste the contents of the clipboard to flie current view 

• X — currently not used 

• New 2 (next blank page icon) - create a subset (genes, people, 
25 etc) from the selected items in the view 

• Recalculate (icon of calculator) - redo computation of statistics, 
etc., depending on the context. 

30 • Help (question mark) - bring up on-line help for the current view. 

The following is a description of Buttons that show up on 



several views: 
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^ Expand (magnifying glass with + sign) - zoom in on the 
graphical display - increase in size 

• Shrink (magnifying glass with - sign) - zoom out on the 
J graphical display - decrease in size 

FIGURE 27. Genelnfo View. This screen provides some of 
the basic information about the currently selected ADRB2 gene. This screen is an 
alternative way of showing information similar to that described in the Gene 

jQ Description View in FIGURE 3. 

FIGURE 28A. GeneStructure View. This screen shows the 
location of features in the gene (such as promoter, introns, exons, etc.), the location 
of polymorphic sites in the gene for each haplotype and the number of times each 
haplotype was seen in various world population groups for the ADRB2 gene. This 
screen is an alternative way of showing information similar to that described in the 
Gene Structure View in FIGURE 4 A. 

FIGURE 28B, GeneStructure View (Cont.)- This screen 
shows a screen which results after a gene feature is selected in the screen of 

20 FIGURE 28A, This screen is an alternative way of shoving information similar to 
that described in the Gene Structure View in FIGURE 4B. An expanded view of the 
nucleotide sequence flanking the selected polymorphic site is shown at the top of 
the screen. This portion of the screen provides access to some of the same 

2j information as shown in FIGURE 5 (Sequence Alignment View). 

FIGURE 29A. Patient Table View/Patient Cohort View. This 
screen shows genotype and haplotype information about each of the members of the 
patient population being analyzed. Family relationships are also shown, when such 
information is present. Families 1333 and 1047 shown in FIGURE 29A are the 

^® families that were analyzed for this gene. In this particular screen, if other families 
had been analyzed, they would appear with those shown, but below, where one 
would scroll down. "Subject" is a unique identifier. The patients' genotypes are 
shown in the top right panel. At the far left of this panel (not seen until one scrolls 

35 over) are the indices for the two haplotypes that a patient has. These indices refer to 
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the haplotype table at the bottom right. The left hand panel shows the haplotype Ids 
for families that have been analyzed as part of a cohort. The haplotypes must follow 
Mendelian inheritance pattern, i.e., one copy form his mother and one from his 
father. For instance if an individual's mother had haplotypes 1 and 2 and his father 
5 had haplotypes 3 and 4, then that individual must have one of the following pairs: 
(1 ,3), (1 ,4), (2,3) or (2,4), This panel is used to check the accuracy of the haplotype 
determination method used. 

FIGURE 29B. Clinical Trial Data View. This screen shows 
gives the values of all of the clinical measurements for each individual in FIGURE 

10 

29A. 

FIGURE 30. HAPSNP View. This screen shows the 
genotype to haplotype resolution of the ADRB2 gene for each of the individuals in 
the population being examined. This view provides similar information as that 
15 shown in the SNP Distribution View of FIGURE 9. 

FIGURE 31. HAPPair View. This screen shows a summary 
of ethnic distribution of haplotypes of the ADRB2 gene. This view is an alternative 
way of showing information similar to that shown in the Haplotype Frequencies 
20 (Summary View) of FIGURE 10. The "V/D" (i.e.. View Details) button in this 
view allows the user to toggle between the views shown in FIGURES 31 and 32. 

FIGURE 32. HAP Pair View (HAP Pair Frequency View). 
This screen shows details of ethnic distribution as a function of haplotypes of the 
ADRB2 gene. Numerical data is provided. This view is an alternative way of 

25 

showing information similar to that shown in the Haplotype Frequencies (Detailed 
View) of FIGURE 1 1 for the CPY2D6 gene. The V/D button has the same function 
as in FIGURE 31. 

FIGURE 33. Linkage View. This screen shows linkage 
30 between polymorphic sites in the population for the ADRB2 gene. This view is an 
altemative way of showing information similar to that shown in FIGURE 12 for the 
CPY2D6gene, 

FIGURE 34. HAPTyping View. This screen shows the 
2^ reliability of haplotyping identification using genotyping at selected positions for 

the ADRB2 gene. This view is an altemative way of showing information similar to 
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that shown in the Genotype Analysis Views of FIGURES 13, 14 and 15 for the 
CPY2D6 gene. This view is the interface to the automated method for determining 
the minimal number of SNPs that must be examined in order to determine the 
haplotypes for a population. See "Step 6*\ Section D(l) and Example 2, herein, for 
5 details of this method. The view shows all pairs ofhaplotypes and their 

corresponding genotypes and finally the frequency of the genotype. The inset 
(which one sees by scrolling to the right) shows the best scoring set of SNPs to 
score, along with a quality score (scores<l) are acceptable. The pairs of numbers in 
brackets are the genotypes that are still indistinguishable given this SNP set. 
"Population" in the box in the top of the figure is equivalent to the "Subset" 
selection menu described above. Populations and subsets are the same. One subset 
is the total analyzed population. 

FIGURE 35. Phylogenetic View. These screens show 

15 minimal spanning networks for the haplotypes seen in the population for the 

ADRB2 gene. This view is an alternative way of showing information similar to 
that shown in FIGURES 1 6 and 1 7 for the CPY2D6 gene. This view also provides 
a window containing haplotype and ethnic distribution information. The numbers 

2Q next to the balls represent the haplotype number and the numbers inside the 

parentheses represent the number of people in the analyzed population that have that 
haplotype. The function of the calculator button (or a red/green flag button, not 
shown in this view) is the same as recalculate in FIGURES 1 6 and 1 7. In this case 
it arranges nodes according to evolutionary distance. 

25 

FIGURE 36. Clinical Haplotype Correlations View 
(Summary). This screen shows a matrix simunarizing the correlation between 
clinical measurements and haplotypes for the ADRB2 gene. This view is an 
alternative way of showing information similar to that shown in FIGURE 1 8 for the 
30 CPY2D6gene. 

Buttons are as described for FIGURES 26 and as follows: 
• Graph (icon of graph) - does a statistics calculation and brings up 
a statistics results window, such as FIGURE 39A. 



35 
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• Normal (icon of bell curve) - does a H APpair ANO V A 
calculation - a specialized statistical calculation. 

• 3 finger down icon - displays a graph showing a histogram of 
clinical data for individiials with specific genetic markers. 



• Thermometer - shows a list of clinical variables for the user to 
select from for display and analysis. 

Some of the viewing modes obtainable by selecting the following 
JO drop-down menus on this view (and the other views on which they appear) are: 

• Scaling: 
Linear 
Log 
Log 10 

• Clinical Mode: 
Summary 
Distribution 
Details 
Quantile 

• Statistic: 
Regression 
ANOVA 
Case Control 
ANCOVA 
Response Model 

FIGURE 37. Clinical Measurements vs. Haplotype View 
(Distribution View). This screen shows the distribution of the patients in each cell 
of the matrix of FIGURE 36. This view is an alternative way of showing 
information similar to that shown in FIGURE 19 for the CPY2D6 gene. Drop-down 
menus and buttons are as described for FIGURE 36. 
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FIGURE 38. Expanded Clinical Distribution View. This 
screen shows an expanded view of one haplotype-pair distribution. Tliis screen 
results when a user selects a cell in the matrix in FIGURE 37, The screen shows the 
number of patients in the various response bins indicated on the horizontal axis. 
5 This view is an alternative way of showing information similar to that shown in 
FIGURE 20 for the CPY2D6 gene, and also displays additional information. 

FIGURE 39A. DecoGen Single Gene Statistics Calculator 
(Linear Regression Analysis View). This screen shows the results of a dose- 
response linear regression calculation on each of the shown individual 
polymorphisms or subhaplotypes with respect to the clinical measure "Delta % 
FEVl pred." The SNPs and subhaplotypes shown are those selected as significant 
in the build-up procedure described below. This view is an alternative way of 
showing information similar to that shown in FIGURE 21 for the CPY2D6 gene and 

15 the ''test*' measurement, with additional information. The numbers in the boxes next 
to "Confidence" and "Fixed Site" in FIGURE 39A are default values for these 
parameters, but can be changed by the user. After they are changed, the user must 
click the "Redo" or "Recalculate" button (the little calculator icon) the regenerate 

2Q the statistic with the new parameters. The first two boxes hold the tight and loose 

cutoffs for the snp-to-hap buildup procedure we have already discussed. The "Fixed 
site" value says how far the buildup can proceed, a value of "4" says produce sub- 
haplotypes with no more that 4 non-* sites. The minus sign says to also do the full- 
haplotype build down procedure. Detecting the Show/Hide button allows the user 

25 

to toggle between modes where all examined correlations are displayed and where 
only those passing the tight statistical criteria are displayed. 

FIGURE 39B. Regression for Delta %FEV1 Pred. View. 
This view shows the regression line response as a function of number of copies of 
30 haplotype **A*****A*G**! 

FIGURE 40. Clinical Measurements vs. Haplotype View 
(Details). This screen gives the mean and standard deviation for each of the cells in 
FIGURE 36. This view is an altemative way of showing some of the information 
similar to that shown in FIGURE 22 for the CPY2D6 gene and the "test" 

35 

measurement. 
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FIGURE 41. Clinical Measurement ANOVA calculation. 
This screen shows the statistical significance between haplotype pair groups and 
clinical response for the Hap pairs for the ADRB2 gene. This view is an alternative 
way of showing some of the information similar to that shown in FIGURE 23 for 
5 the CPY2D6 gene and the **test" measurement. 

FIGURE 42. Cinical Variables View. This figure simply 
shows histogram distributions for each of the clinical variables. This is the same as 
Figure 38, but not selected by haplotype pair. A clinical measurement is chosen by 
selecting one of the lines in the top list 

FIGURE 43. Clinical Correlations View. This view allows 
one to see the correlation between any pair of clinical measurements. The user 
selects one measurement from the list on the left, which becomes the x-axis, and one 
from the list on the right, which becomes the y-axis. Each point on the bottom graph 
represents one individual in the clinical cohort. 

FIGURE 44A. Genomic Repository data submodel. This is a 
preferred alternative model to the submodels shown in FIGURES 25A and 25D. 

FIGURE 44B. Clinical Repository data submodel. This is a 
prefened alternative submodel to that shown in FIGURE 25B. 

FIGURE 44C. Variation Repository data submodel. This is 
an alternative submodel to that shown in FIGURE 25C. 

FIGURE 44D. Literature Repository data submodel. This 
incorporates some of the tables from the gene repository submodel shown in 
FIGURE 25A. 

FIGURE 44E. Drug Repository data submodel. This is an 
alternative submodel to that shown in FIGURE 2SE« 

FIGURE 44F, Legend of symbols in FIGURES 44A.E. 
FIGURE 45, Flowchart. This is a flow chart for a multi- 
SNP analysis method of associating phenotypes (such as clinical outcomes) with 
haplotypcs (also called a "build-up" procedure). 

FIGURE 46. Flowchart. This is a flow chart for a reverse- 
SNP analysis method of associating phenotypes (such as clinical outcomes) with 
haplotypes (also called a "pare-down" procedure). 
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FIGURE 47. Diagram of a process for assembling a genomic 
sequence by a human or a computer, 

FIGURE 48. Diagram of a process for generating and 
displaying a gene structure. 
5 FIGURE 49. Diagram of a process of generating and 

displaying a protein structure. 

Vn, DETAILED DESCRIPTION OF THE INVENTION 

10 A. DEFINITIONS 

The following definitions are used herein: 

Allele - A particular form of a genetic locus, distinguished 

from other forms by its particular nucleotide sequence. 

Ambiguous polymorphic site - A heterozygous 

polymorphic site or a polymorphic site for which nucleotide sequence information is 

lacking. 

Candidate Gene — A gene which is hypothesized or known 
to be responsible for a disease, condition, or the response to a treatment, or to be 
20 correlated with one of these. 

Full Polymorphic Set - The polymorphic set whose 
members are a sequence of all the known polymorphisms. 

Full-genotype - The unphased 5' to 3 ' sequence of 
nucleotide pairs found at all known polymorphic sites in a locus on a pair of 
homologous chromosomes in a single individual. 

Gene - A segment of DNA that contains all the information 
for the regulated biosynthesis of an RNA product, including promoters, exons, 
introns, and other untranslated regions that control expression. 
30 Gene Feature - A portion of the gene such as, e.g., a single 

exon, a single intron, a particular region of the 5' or 3 '-untranslated regions. The 
gene feature is always associated with a continuous DNA sequence. 

Genotype - An xmphased 5 ' to 3 ' sequence of nucleotide 
35 pair(s) found at one or more polymorphic sites in a locus on a pair of homologous 
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chromosomes in an individual. As used herein, genotype includes a full-genotype 
and/or a sub-genotype as described below. 

Genotyping - A process for determining a genotype of an 

individual. 

Haplotype - A member of a polymorphic set, e.g., a 
sequence of nucleotides found at one or more of the polymorphic sites in a locus in 
a single chromosome of an individual. (See, e.g., HAP 1 in FIGURE 4A full 
haplotype is a member of a full polymorphic set). A sub-haplotype is a member of a 
polymorphic subset. 

Haplotype data - Information concerning one or more of the 
following for a specific gene: a listing of the haplotype pairs in each individual in a 
population; a listing of the different haplotypes in a population; frequency of each 
haplotype in that or other populations, and any known associations between one or 
15 more haplotypes and a tmit. 

Haplotype pair - The two haplotypes found for a locus in a 

single individual. 

Haplotyping — A process for determining one or more 
2Q haplotypes in an individual and includes use of family pedigrees, molecular 
techniques and/or statistical inference. 

Isoform - A particular form of a gene, mRNA, cDNA or the 
protein encoded thereby, distinguished firom other forms by its particular sequence 
and/or structure. 

25 

Isogene - One of the two copies (or isoforms) of a gene 
possessed by an individual or one of all the copies (or isoforms) of the gene found in 
a population. An isogene contains all of the polymorphisms present in the particular 
copy (or isoforms) of the gene. 
30 Isolated - As applied to a biological molecule such as RNA, 

DNA, oligonucleotide, or protein, isolated means the molecule is substantially free 
of other biological molecules such as nucleic acids, proteins, lipids, carbohydrates, 
. or other material such as cellular debris and growth media. Generally, the term 
"isolated" is not intended to refer to a complete absence of such material or to 
absence of water, buffers, or salts, unless they are present in amounts that 
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substantially interfere with the methods of the present invention. 

Locus — A location on a chromosome or DNA molecule 
corresponding to a gene or a physical or phenotypic feature. 

Nucleotide pair — The nucleotides found at a polymorphic 
S site on the two copies of a chromosome from an individual. 

Phased - As applied to a sequence of nucleotide pairs for two 
or more polymorphic sites in a locus, phased means the combination of nucleotides 
present at those polymorphic sites on a single copy of the locus is knovm. 

Polymorphic Set ™ A set whose members are a sequence of 
one or more polymorphisms found in a locus on a single chromosome of an 
individual. See, e.g., the set having members HAP 1 through HAP 10 in FIGURE 
4A. 

Polymorphic site — A nucleotide position within a locus at 
15 which the nucleotide sequence varies from a reference sequence in at least one 

individual in a population. Sequence variations can be substitutions, insertions or 
deletions of one or more bases. 

Polymorphic Subset — The polymorphic set whose members 
2Q are fewer than all the known polymorphisms. 

Polymorphism — The sequence variation observed in an 
individual at a polymorphic site. Polymorphisms include nucleotide substitutions, 
insertions, deletions and microsatellites and may, but need not, result in detectable 
differences in gene expression or protein function. 

25 

Polymorphism data — Information concerning one or more 
of the following for a specific gene: location of polymorphic sites; sequence 
variation at those sites; frequency of polymorphisms in one or more populations; the 
different genotypes and/or haplotypes determined for the gene; frequency of one or 
30 more of these genotypes and/or haplotypes in one or more populations; any known 
association(s) between a trait and a genotype or a haplotype for the gene. 

Polymorphism Database - A collection of polymorphism 
. data arranged in a systematic or methodical way and capable of being individually 
accessed by electronic or other means. 

Polynucleotid - A nucleic acid molecule comprised of 
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single-stranded RNA or DNA or comprised of complementary, double-stranded 
DNA. 

Reference Population - A group of subjects or individuals 
who are representative of a general population and who contain most of the genetic 
5 variation predicted to be seen in a more specialized population. Typically, as used 
in the present invention, the reference population represents the genetic variation in 
the population at a certainty level of at least 85%, preferably at least 90%, more 
preferably at least 95% and even more preferably at least 99%. 

Reference Repository - A collection of cells, tissue or DNA 
samples from the individuals in the reference population. 

Single Nucleotide Polymorphism (SNP) - A polymorphism 
in which a single nucleotide observed in a reference individual is replaced by a 
different single nucleotide in another individual. 
15 Sub-genotype - The unphased 5' to 3 ' sequence of 

nucleotides seen at a subset of the known polymorphic sites in a locus on a pair of 
homologous chromosomes in a single individual. 

Subject - An individual (person, animal, plant or other 
2Q eukaryote) whose genotype(s) or haplotype(s) or response to treatment or disease 
state are to be determined. 

Treatment - A stimulus administered internally or externally 

to an individual. 

Unphased - As applied to a sequence of nucleotide pairs for 

25 

two or more polymorphic sites in a locus, unphased means the combination of 
nucleotides present at those polymorphic sites on a single copy of the locus (i.e., 
located on a single DNA strand), is not known. 

World Population Group - Individuals who share a 
30 common ethnic or geographic origin. 

B. METHODS OF IMPLEMENTING THE INVENTION 

The present invention may be implemented with a computer, 
an example of which is shown in FIGURE 1 A. The computer includes a central 

35 

processing unit (CPU) connected by a system bus or other connecting means to a 
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communication interface, system memory (RAM), non-volatile memory (ROM), 
and one or more other storage devices such as a hard disk drive, a diskette drive, and 
a CD ROM drive. The computer may also include an internal or external modem 
(not shown). The computer also includes a display device, such as a CRT monitor 
5 or an LCD display, and an input device, such as a keyboard, mouse, pen, touch- 
screen, or voice activation system. The computer stores and executes various 
programs such as an operating system and application programs. The computer 
may be embodied, for example, as a personal computer, work station, laptop, 
mainframe, or a personal digital assistant. The computer may also be embodied as a 
distributed multi-processor system or as a networked system such as a LAN having 
a server and client terminals. 

The present invention uses a program, referred to as the 
"DecoGen™ application", that generates views (or screens) displayed on a display 

15 device and which the user can interact with to accomplish a variety of tasks and 
analyses. For example, the DecoGen^" application may allow users to view and 
analyze large amounts of information such as gene-related data (e.g., gene loci, gene 
structure, gene family), population data (e.g., ethnic, geographical, and haplotype 

2Q data for various populations), polymorphism data, genetic sequence data, and assay 
data. The DecoGen''*' application is preferably written in the Java programming 
language. However, the application may be written using any conventional visual 
programming language such as C, C-H-, Visual Basic or Visual Pascal. The 
DecoGen™ application may be stored and executed on the computer. It may also be 

25 

stored and executed in a distributed manner. 

The data processed by the DecoGen™ application is 
preferably stored as part of a relational database (e.g., an instance of an Oracle 
database or a set of ASCII flat files). This data can be stored on, for example^ a CD 
30 ROM or on one or more storage devices accessible by the computer. The data may 
be stored on one or more databases in communication with the computer via a 
network. 

In one scenario, the data will be delivered to the user on any 
standard media (e.g., CD, floppy disk, tape) or can be downloaded over the internet. 
The DecoGen™ application and data may also be installed on a local machine. The 



wo 01/01218 



PCT/USOO/17540 



-27- 

DecoGen application and data will then be on the machine that the user directly 
accesses. Data can be transmitted in the form of signals. 

FIGURE IB shows an implementation where a network 
interconnects one or more host computers with one or more user terminals. The 
5 communication network may, for example, include one or more local area networks 
(LANs), metropolitan area networks (MANs), wide area networks (WANs), or a 
collection of interconnected networks such as the Internet. The network may be 
wired, wireless, or some combination thereof. The host computer may, for example, 
be a world wide web server ("web server"). The user terminal may, for example, be 
a client device such as a computer as shown in FIGURE 1 A. 

A web server stores information docimients called pages. A 
server process listens for incoming connections from clients (e.g., browsers running 
on a client device). When a connection is established, the client sends a request and 
15 the server sends a reply. The request typically identifies a page by its Uniform 
Resource Locator (URL) and the reply includes the requested page. This client- 
server protocol is typically performed using the hypertext transfer protocol ("http"). 
Pages are viewed using a browser program. They are written in a language called 
2Q hypertext markup language ("html"). A typical page includes text and formatting 
comments called tags. Pages may also include links (pointers) to other pages. 
Strings of text or images that are links to other pages are called hyperlinks. 
Hyperlinks are highlighted (e.g., by shading, color, underlining) and may be 
invoked by placing the cursor on the highlighted area and selecting it (e.g., by 

25 

clicking the mouse button). A page may also contain a URL reference to a portion 
of multimedia data such as an image, video segment, or audio file. Pages may also 
point to a Java program called an applet. When the browser connects to where the 
applet is stored, the applet is downloaded to the client device and executed there in a 

30 secure manner. Pages may also contain forms that prompt a user to enter 

information or that have active maps. Data entered by a user may be handled by 
common gateway interface (CGI) programs. Such programs may, for example, 
. provide web users with access to one or more databases. 

2^ As shown in FIGURE IB the host computer may include a 

CPU connected by a system bxis or other connecting means to a commimication 
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interface, system memoiy (RAM), nonvolatile (ROM), and a mass storage device. 
The mass storage device may, for example, be a collection of magnetic disk drives 
in a RAID system. The mass storage device may, for example, store the 
aforementioned web pages, applets, and the like. The host computer may also 
5 include an input device, such as a keyboard, and a display device to allov^ for 

control and management by an administrator. Additionally, the host computer may 
be connected to additional devices such as printers, auxiliary monitors or other 
input/output devices. The input device and display device may also be provided on 
another computer coupled to the host computer. The host computer may be 
embodied, for example, as one or more mainframes, workstations, personal 
computers, or other specialized hardware platforms. The functionality of the host 
computer may be centralized or may be implemented as a distributed system. As 
also shown in FIGURE IB, the host computer may communicate with one or more 

15 databases stored on any of a variety of hardware platforms. 

In an Internet scenario, for example involving the system of 
FIGURE IB, the DecoGen™ application will be web-based and will be delivered as 
an applet that runs in a web browser. In this case, the data will reside on a server 

2Q machine and will be delivered to the DecoGen application using a standard protocol 

(e.g., HTTP with cgi-bin). To provide extra security, the network connection could 

use a dedicated line. Furthermore, the network connection could use a secure 

protocol such as Secure Socket Layer (SSL) which only provides access to the 

server from a specified set of IP addresses. 
25 TO 

In another scenario, the DecoGen application can be 

installed on a user machine and the data can reside on a separate server machine. 
Communication between the two machines can be handled using standard client- 
server technology. An example would be to use TCP/IP protocol to communicate 
30 between the client and an oracle server. 

It may be noted that in any of the prior scenarios, some or all 
of the data used by the DecoGen™ application could be directly imported into the 
. DecoGen™ application by the user. This import could be carried out by reading files 
residing on the user's local machine, or by cutting and pasting from a user document 
into the interface of the DecoGen™ application. 
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In yet a further scenario, some or all of the data or the results 
of analyses of the data could be exported from the DecoGen^'^ application to the 
user's local computer. This export could be carried out by saving a file to the local 
disk or by cutting and pasting to a user document. 

In the present invention various calculations are performed to 
generate items displayed on a screen or to control items displayed on a screen. As is 
well known, some basic calculations may be performed using database query 
language (SQL), while other computations are performed by the DecoGen™ 
application (i.e., the Java program which, as previously mentioned, may be an 
applet downloaded over the internet.) 

C. CTS^ METHODS OF THE INVENTION 

The CTS™ embodiment of present invention preferably 
IS includes the following steps: 

1 . A candidate gene or genes (or other loci) predicted to be 
involved in a particular disease/conditipn/drug response is determined or chosen. 

2. A reference population of healthy individuals with a broad 
and representative genetic background is defined. 

3. ' For each member of the reference population, DNA is 
obtained, 

4. For each member of the reference population, the 
haplotypes for each of the candidate gene(s), (or other loci) are found. 

2S 5, Population averages and statistics for each of the gene(s) 

(loci)/haplotypes in the reference population are determined. 

6. (Optional step) An optimal set of genotyping markers is 
determined. These markers allow an individuaVs haplotypes to be accurately 
predicted without using direct molecular haplotype analysis. The predictive 
haplotyping method relies on the haplotype distribution found for the reference 
population. 

7. Atrialpopulationof individuals with the medical 
condition of interest is recruited. 

8. Individuals in the trial population are treated using some 
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protocol and their response is measured. They are also haplotyped, for each of the 
candidate gene(s), either directly or using predictive haplotyping based on the 
genotype. 

9. Correlations between individual response and haplotype 
content are created for the candidate gene(s) (or other loci). From these 
correlations, a mathematical model is constructed that predicts response as a 
function of haplotype content. 

1 0. (Optional) Follow-up trials are designed to test and 
validate the haplotype-response mathematical model. 

1 1 . (Optional) A diagnostic method is designed (using 
haplotyping, genotyping, physical exam, serum test, etc.) to determine those 
individuals who will or will not respond to the treatment. 

These steps are now described in further detail below: 
1 . A candidate gene or genes (or other loci) for the 
disease/condition is determined. 

hi the CTS embodiment of the invention, candidate gene(s) 
(or other loci) are a subset of all genes (or other loci) that have a high probability of 
being associated with the disease of interest, or are known or suspected of 
interacting with the drug' being investigated. Interacting can mean binding to the 
drug during its normal route of action, binding to the drug or one of its metabolic 
products in a secondary pathway, or modifying the drug in a metabolic process. 
Candidate genes can also code for proteins that are never in direct contact with the 
drug, but whose environment is affected by the presence of the drug. In other 
embodiments of the invention, candidate gene(s) (or other loci) may be those 
associated with some other trait, e.g., a desirable phenotypic tmit. Such gene(s) (or 
other loci) may be, e.g., obtained from a himian, plant, animal or other eukaiyote. 
Candidate genes are identified by references to the literature or to databases, or by 
performing direct experiments. Such experiments include (1) measuring expression 
differences that result from treating model organisms, tissue cultures, or people with 
the drug; or (2) performing protein-protein binding experiments (e.g., antibody 
binding assays, yeast 2 hybrid assays, phage display assays) using known candidate 
proteins to identify interacting proteins whose corresponding nucleotide (genomic 
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or cDNA) sequence can be determined. 

Once the candidate gene(s) (or other loci) are identified, 
information about them is stored in a database. This information includes, for 
example, the gene name, genomic DNA sequence, intron-exon boundaries, protein 
sequence and structure, expression profiles, interacting proteins, protein function, 
and known polymorphisms in the coding and non-coding regions, to the extent 
known or of interest. This information can come fi*om public sources (e.g. 
GenBank, OMIM (Online Inheritance of Man — a database of polymorphisms linked 
to inherited diseases), etc.) For genes that are not fiilly characterized, this step would 
generally require that the characterization be done. However, this is possible using 
standard mapping, cloning and sequencing techniques. The minimum amount of 
information needed is the nucleotide sequence for important regions of the gene. 
Genomic DNA or cDNA sequences are preferably used. 
15 In the present invention, a person may use a user terminal to 

view a screen which allows the user to see all of the candidate genes associated with 
the disease project and to bring up further information. This screen (as well as all 
the other screens described herein) may, for example, be presented as a web page, or 
2Q a series of web pages, from a web server. This web based use may involve a 

dedicated phone line, if desired. Alternatively, this screen may be served over the 
networic from a non-web based server or may simply be generated within the user 
terminal. An example of such a screen referred to herein as a "Pathways" or "Gene 
Collection" screen is illustrated in FIGURE 2. 

25 

1. Illustration Using The CYP2D6 Gene 

FIGURE 2 is an example of a screen showing the set of 
candidate genes whose polymorphisms potentially contribute to the response to a 
drug or to some other phenotype. The screen shows genes for which data is 
currently available in a database useful in the invention in green; those queued for 
processing (and for which data will appear in a database) would appear in one shade 
or color, e.g., yellow, and related but unqueued genes (those for which there is 
currently no plan to deposit data in a database) would appear in another shade or 

35 

color, e.g., white. Drugs (typically ones that interact with one or more of the genes 
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of interest) would be shown in a third shade or color, e.g., light blue. The user can 
select a gene to examine in detail by using the mouse (or other user-input device 
such as keyboard, roller ball, voice recognition, etc.) to select the corresponding 
icon. In the example depicted in FIGURE 2, C YP2D6, a cytochrome P 450 
5 enzyme, is selected, as indicated by the extra black box around the CYP2D6 icon. 

At the left of each screen is a menu that allows the user to navigate through different 
screens of the data. 

A preferred embodiment of the present invention relates to 
situations in which patients have differential responses to the drug because they 
possess different forms of one or more of the candidate genes (or other loci). (Here 
different forms of the candidate gene(s) mean that the patients have different 
genomic DNA sequences in the gene locus). The method does not rely on these 
differences being manifested in altered amino acids in any of the proteins expressed 

15 by any candidate gene(s) (e.g., it includes polymorphisms that may affect the 
efficiency of expression or splicing of the corresponding mRNA). All that is 
required is that there is a correlation between having a particular form(s) of one or 
more of the genes and a phenotypic trait (e.g. response to a drug). Examples of 

2Q salient information about the candidate genes is given in FIGURES 3-8. 

FliSURE 3 is an example of a screen showing basic 
information about the currently selected gene such as its name, definition, function, 
organism, and length. These pieces of information typically come from GenBank or 
other public data sources. The figure will typically also show the number of ''gene 

25 

features'' (e.g. exons, introns, promoters, 3' untranslated regions, 5* untranslated 
regions, etc.) in the database, the size of the analyzed population (group of people 
whose DNA has been examined for this gene), the number of haplotypes found for 
this gene in this population, and some meaisures of polymorphism fi^uency. The 
30 information is stored in a database such as the one described herein, or calculated 
firom information stored in such a database. Most of the information shown in later 
figures is specific to this analyzed population. Theta and Pi are standard measures 
. of polymorphism frequency, described in Ref. 1 Chapter 2. 

FIGURE 4A and 4B are examples of screens showing the 
genomic structure of the gene (generally showing the location of features of the 
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gene, such as promoters, exons, introns, 5* and 3' untranslated regions), as well as 
haplotype information. FIGURE 4A shows the location of the features in the gene, 
the location of the polymorphic sites along the gene, the nucleotides at the 
polymorphic sites for each of the haplotypes, and the number of times each 
5 haplotype was seen in the representatives of each of 4 world population groups 
(CA= Caucasian, AA= African American, HL= Hispanic/Latino, AS= Asian) 
included in the population analyzed for this gene. All of this data resides in a 
database or is calculated from the data in a database. The top view shows the 
nucleotides at the polymorphic sites, i.e., the haplotypes. The middle cartoon shows 

10 

the features of the gene. In this example the promoter is indicated by a dark shaded 
(or red) rectangular box and a line with an arrow, exons are shown by a gray shaded 
(or blue) rectangular box and introns are shown in white (or in yellow). When the 
mouse is held over a feature, the feature turns red and the name of the feature 

15 appears (e.g., in this case, Gene). The code in parenthesis (M22245) is the 

GenBank accession number for the selected feature. FIGURE 4B is the same screen 
as FIGURE 4A, after the user selects the gene feature. Under the cartoon of the 
features are vertical bars indicating the positions of the polymorphic sites, with one 

20 xmique haplotype. The letter "d'* indicates that there is a deletion. The table 

at the left gives the number of haplotype copies seen in each of the standard 
populations. For instance, this screen indicates that there are 10 copies of haplotype 
10 in Caucasians, 2 copies in African Americans, and none in Hispanic/Latinos or 
Asians, for a total of 1 2 copies. Note that the total number of haplotypes is twice 

25 

the number of individuals examined. At the very bottom is an expanded cartoon of 
the feature. One may display data concerning a particular polymorphism by 
selecting the corresponding vertical bar on the expanded cartoon. The selected bar 
may be identified, e.g., by a shaded or colored circle. The data for the 
30 polymorphism appears at the lower left of the screen. This gives the number of 

copies of each nucleotide (A,C,G or T) seen in each of the world population groups. 

FIGURE 5 is an example of a screen showing the actual DNA 
. sequence of the genomic locus for the different haplotypes seen in the population 
(i.e., the sequence of the isogenes). This view appears in a separate window when 
one of the features in the Gene Structure Screen (FIGURE 4A or 4B) is selected 
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with the mouse or other input device. This shows an alignment between the fiill 
DNA sequences for all of the isogenes of the CYP2D6 gene in the database. The 
polymorphic positions are highlighted. 

FIGURE 6 is an example of a screen showing the predicted 
5 secondary structure of the mRNA transcript for each C YP2D6 isogene in the 

database. The secondary structure is predicted using a detailed thermodynamic 
model as implemented in the program RNA structure (REF. 2). This is useful 
because many of the polymorphisms detected do not change the amino acid 

. composition of the resulting protein but still lie in the coding region of the gene. 

One result of such a silent mutation could be to alter the intermediate mRNA's 
• structure in a way that could affect mRNA stability, or how (and if) the mRNA was 
spliced, transcribed or processed by the ribosome. Such a polymorphism could keep 
any of the protein from being expressed and from being available to carry out its 

15 functions. In this screen, the user can see thumbnail views of the structures for all 
of the isogenes and can see a selected one of these structures expanded on the right 
hand side of the screen. Changes in this structure caused by the polymorphisms 
seen in the isogenes can affect the expression into protein of the gene. The 

2Q information presented in this screen can serve as an aid to the user to detect possible 
effects of these polymorphisms. 

FIGURE 7 is an example of a screen showing a schematic of 
the structure of the protein expressed by the gene, including important domains and 
the sites of the coding polymorphisms. The user gets to this screen by selecting the 

25 

"Protein Structure" link at the left hand side of the display. This screen shows 
various important motife found in the protein, and places the polymorphic sites in 
the context of these motifs. The user can get information on each motif or 
polymorphism by selecting the appropriate icon for the polymorphic site. In this 
30 example, the result of selecting the first polymorphic site (as indicated by the red 
shadow behind the icon) is shown. The text above at the top shows the reference 
codon and amino acid (CCT, Pro) and the resulting altered codon and amino acid 
. (TCT, Ser). Also given are the codon frequencies in parentheses. These are 
calculated by looking at 10,000 codons in a variety of human genes and calculating 
how often that particular codon shows up. (REF. 3). 
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2. A reference population of healthy individuals with a 
broad and representative genetic background is defined. 

Analysis of the candidate gene(s) (or other loci) requires an 
approximate knowledge of what haplotypes exist for the candidate gene(s) (or other 
loci) and of their frequencies in the general population. To do this, a reference 
population is recruited, or cells from individuals of known ethnic origin are obtained 
from a public or private source. The population preferably covers the major 
ethnogeographic groups in the U.S., European, and Far Eastern pharmaceutical 
markets. An algorithm, such as that described below may be used to choose a 
minimum number of people in each population group. For example, if one wants to 
have a q% chance of not missing a haplotype that exists in the population at a p% 
frequency of occurring in the reference population, the number of individuals (n) 
who must be sampled is given by 2n=log(l-q)/log(l-p) where p and q are expressed 
15 as fractions. For instance, if p is 0.05 (i.e., if one wants to find at least one copy of 
all haplotypes found at greater than 5% frequency) and q is 0.99 (i.e., one wants to 
be sure to the 99% level of confidence of finding the >5% frequency haplotypes), 
then n=K).5*log(.01)/log(.95)-45. There is always a tradeoff between how rare a 
20 haplotype one wants to be guaranteed to see and the cost of experimentally 
determining haplotypes. 

3. For each member of the population, DNA is obtained. 
In the preferred embodiment, for each member of the 

reference population (called a subject), blood samples are drawn, and, preferably, 
immortalized cell lines are produced. The use of immortalized cell lines is preferred 
because it is anticipated that individuals will be haplotyped repeatedly, i.e., for each 
candidate gene (or other loci) in each disease project. As needed, a cell sample for a 
member of the population could be taken from the repository and DNA extracted 
30 therefrom. Genomic DNA or cDNA can be extracted using any of the standard 
methods. 

4. For each member of the population, the haplotypes for 
. each of the candidate gene(s) (or other loci) are found. 

The 2 haplotypes for each of the subject's candidate gene(s) 
(or other loci) are determined. The most preferred method for haplotyping the 
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reference population is that described in U.S. Application Serial No, 60/198,340 
(inventors Stephens et al.), filed April 1 8, 2000, which is specifically incorporated 
by reference herein. Another, less preferred embodiment for haplotyping the 
reference population, uses the CLASPER System™ technology (Ref. U.S. Patent 
5 Number 5,866,404), which is a technique for direct haplotyping. Other examples of 
the techniques for direct haplotyping include single molecule dilution ("SMD") 
PGR (Ref. 9) and allele-specific PGR (Ref 10). However, for the purpose of this 
invention, any technique for producing the haplotype information may be used. 

The information that is stored in a database, such as a 

10 

database associated with the DecoGen application exemplified herein includes (1) 
the positions of one or more, preferably two or more, most preferably all, of the sites 
in the gene locus (or other loci) that are variable (i.e. polymorphic) across members 
of the reference population and (2) the nucleotides found for each individuals' 2 
15 haplotypes at each of the polymorphic sites. Preferably, it also includes individual 
identifiers and ethnicity or other phenotypic characteristics of each individual. 

hi the preferred embodiment of the invention, the haplotypes 
and their firequencies are stored and displayed, preferably in the manner shown, e.g., 
in FIGURES 4A and 4B. Haplotypes and other information about each of the 
members of the population being analyzed can be shown, for example, in the 
manner shown in FIGURE 8. The information shown in FIGURE 8 includes a 
unique identifier (FID), ethnicity, age, gender, the 2 haplotypes seen for the 
individual, and values of all clinical measurements available for the individual. 
(Quantitative values of clinical measures would ordinarily be seen by scrolling to the 
right. However, for the subjects seen in this view, there is no clinical data. This is 
because this is the reference population of healthy individuals. 

The haplotype data may also be presented in the context of 
the entire DNA sequence. Examples of the sequences of the isogenes, with the 
polymorphisms highlighted, are shown in FIGURE 5. 

Because an individual has 2 copies of the gene (2 isogenes), 
and because these 2 copies are often different, some of the polymorphic sites will 
show 2 different nucleotides in a genotype, one firom each of the isogenes. A 
genotype fi*om an individual with haplotypes TAG and CAG would be 
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(T/C),A,(C/G). This is consistent with the haplotypes TAC/CAG or TAG/CAC. 
The fact that we do not know which haplotypes gave rise to this genotype leads us 
to call this an "unphased genotype". If we haplotype this individual we then 
determine the "phased genotype'*, which describes which particular nucleotides go 
5 together in the haplotypes. Phasing is the description of which nucleotide at one 
polymorphic site occurs with which nucleotides at other sites. This information is 
left ambiguous (i.e., xinphased) in a genotyping measurement but is resolved (i.e., 
phased) in a haplotype measurement. 

FIGURE 9 is an example of a screen showing the genotype to 
haplotype resolution for each of the individuals in the population being examined. 
At the left of the screen is a shaded (or color) matrix showing the genotype 
information at each of the pol)anorphic sites for each individual (sites across the top, 
individuals going down the page). The most and least common nucleotide at each 

15 site is defined by looking at both haplotypes of all individuals in the population at 
that particular site. The nucleotide that shows up most often is called the most 
common nucleotide. The one that shows up less often is termed the least common. 
In situations where more than 2 nucleotides are seen at a site (which is rare but not 

2Q imknown in human genes) all nucleotides except the most common one are lumped 
together in the least common category. At the right is a shaded (or color) matrix 
showing the haplotype resolution. In the genotype view, a blue square indicates that 
the individual is homozygous for the most common nucleotide at that site. A yellow 
square indicates that the individual is homozygous for the least common base, and a 

25 

. red square indicates that the individual is heterozygous at the site. On the right hand 
side, a row for an individual is broken into a top and a bottom half, each 
representing one of the two haplotypes. The color scheme is the same as on the left 
except that all of the heterozygous sites have been resolved. The + and - buttons are 
30 for zooming in and out. 

Unrelated individuals who are heterozygous at more than 1 
site cannot be haplotyped without (1) using a direct molecular haplotyping method 
. such as CLASPER System™ technology or (2) making use of knowledge of 
haplotype frequencies in the population, as described below or, preferably, as 
described in U.S. Application Serial No. 60/198,340 (inventors Stephens et al.). 
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filed April! 8, 2000. 

S. Population averages and statistics for each of the 
haplotypes in the reference population are determined. 

Once the individual haplotypes of the reference population 
5 have been determined the population statistics may be calculated and displayed in a 
manner exemplified herein in FIGURE 10. FIGURE 10 is an example of one of 
several screens showing information about the pair of haplotypes for the candidate 
gene(s) (or other loci) found in an individual. In this screen, each cell of the matrix 
displays some information about the group of people who were found to have the 
haplotypes corresponding to the particular row and column. In all of these screens, 
subjects can be grouped together by pairs of haplotypes or sub-haplotypes, where a 
sub-haplotype is made up of a subset of the total group of polymorphic sites. For 
. example, at the top of the screen in the figure are checkboxes allowing the user to 

15 select the subset of polymorphic sites to be examined (here sites 2 and 8 are 

chosen). The + and — buttons are for zooming in and out, which increases and 
decreases the viewing size of the matrix. The "Recalculate" button causes the 
statistics for the groups to be recalculated after a new subset of polymorphic sites 

2Q has been selected. At the bottom is the matrix. The selected cell (outlined in green 
m this figiire) displays information about subjects who are homozygous for C and G 
at sites 2 and 8. The text to the right gives sununaiy numerical uiformation about 
the subjects in that box. In particular, this screen shows the distribution of subjects 
in tiie different ethnogeographic groups with each of the haplotype pairs. In this 

25 

example^ 23 subjects (18 Caucasians and 5 Asians) were found to be homozygous 
for C and G at sites 2 and 8. In this example, the heights of the bars are normalized 
individually for each cell so that it is not possible in this example to see relative 
numbers of individuals cell to cell by looking at the heights. An alternative 
30 normalization (in which there is a consistent normalization for ail boxes), is also 
possible. More detailed information is available by selecting the "View Details" 
button at the top (s^ FIGURE 1 1). 

FIGURE 1 1 is a more detailed view of the information that is 
available from the summary view shown in FIGURE 10. At the bottom, one row is 
shown for each haplotype pair found in the population being analyzed. Each row 
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shows the corresponding 2 sub-haplotypes, the total number of individuals found 
with that sub-haplotype arid the fraction of the total population represented by this 
number. Next to these are 3 columns for each ethnogeographic group. The first 
gives the number of individuals in that ethnogeographic group with that haplotype 

5 pair. The second gives the fraction of individuals (found in a database of the present 
invention) in that world population group who have that haplotype pain The third 
column gives the expected number based on Hardy- Weinberg equilibrium. 

The observed haplotype pair frequencies in the population in 

^ particular, the reference population, are preferably corrected for finite-size samples. 
This is preferably done when the data is being used for predictive genotyping. If it 
is assimied that each of the major population groups will be in Hardy- Weinberg 
equilibrium, this allows one to estimate the underlying frequencies for haplotype 
pairs in the reference population that are not directly observed. It is necessary to 

5 have good estimates of the haplotype-pair frequencies in the reference population in 
order to predict subjects' haplotypes from indirect measurements that will be used in 
a diagnostic context (see item 6). Preferably the reference population has been 
chosen to be representative of the population as a whole so that any haplotypes seen 

Q in a clinical population have already been seen in the reference population. 

Furthermore, it would be possible to determine whether certain haplotypes are 
enriched in the patient population relative to the reference population. This would 
indicate that those haplotypes are causative of or correlated with the disease state. 

Hardy-Weinberg equilibrium (Ref. 1, Chapter 3) postulates 

5 

that the frequency of finding the haplotype pair //, / is equal to 
p„_AHJH,)^ 2piH,)p(H,) if H, ^ H, and p„_AH, IH,)^ p(H,)p(H,) if 
H^- H^, Here, p{H^) (where i=l or 2) is the probability of finding the haplotype 
in the population, regardless of whatever other haplotype it occurs with. Hardy- 

0 

Weinberg equilibrium usually holds in a distinct ethnogeographic group unless there 
is significant inbreeding or there is a strong selective pressure on a gene. Actual 
observed population frequencies p^{H^/H2) and the corresponding Hardy- 
Weinberg predicted frequencies p^.^,^ (//, / //j ) are shown in FIGURE 1 1 , 
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discussed above. 

If large deviations from Hardy- Weinberg equilibrium are 
observed in the reference population^ the number of individuals can be increased to 
see if this is a sampling bias. If it is not, then it may be assumed that the haplotype 
5 is either historically recent or is under selection pressure. A statistical test may be 

used, e.g., test is |Pobs - Pn-w| > • If so, the variation is large. 

6. (Optional — this step can be skipped if direct 
molecular haplotyping will be used on all clinical samples.) An optimal set of 
genotyping markers is determined. These markers often allow an individual's 
haplotypes to be accurately predicted without using full haplotype analysis. This 
genotyping method relies on the haplotype distribution found directly from the 
reference population. 

One of several methods to test subjects for the existence of a 
given pair of haplotypes in an individual can be used. These methods can include 
finding surrogate physical exam measurements that are found to correlate with 
haplotype pair; serum measurements (e.g., protein tests, antibody tests, and small 
molecule tests) that correlate with haplotype pair; or DNA-based tests that correlate 
with haplotype pair. An example that is used herein is to predict haplotype pair 
based on an (unphased) genotype at one or more of the polymorphic sites using an 
algorithm such as the one described further below. 

For example, as discussed above, in the case where the two 
haplotypes are TAC and GAT, the genotyping information would only provide the 
information that the subject is heterozygous T/G at site 1, homozygous A at site 2 
and heterozygous C/T at site 3. This genotype is consistent with the following 
haplotype pairs: TAC/GAT (the correct one) and GAC/TAT (the incorrect one). 
Assuming that the underlying probability (as measured in the reference population) 
for TAC/GAT is p% and for GAC/TAT is q%, subjects may be randomly assigned 
to the first group with a probability p/(p+q) and to the second group with a 
probability q/(p+q). If p»q, then subjects will almost always be correctly assigned 
to the correct haplotype pair group if they are TAC/GAT, but the GAC/TAT 
individuals will always be mis-classified. However, the majority of individuals will 



wo 01/01218 



PCT/USOO/17540 



-41 - 

o 

be assigned to the correct haplotype-pair group. In the case that q=0, the correct 
assignment will always be made. For cases where p-q, this classification gives very 
low accuracy predictions, so other methods to resolve the subjects' haplotypes must 
be resorted to. One can always directly find the correct haplotypes using CLASPER 
5 System^*^ technology or other direct molecular haplotyping method. 

The ability to use genotypes to predict haplotypes is based on 
the concept of linkage. Two sites in a gene are linked if the nucleotide found at the 
first site tends to be correlated with the nucleotide found at the second site. Linkage 
calculations start with the linkage matrix, which gives the probabilities of finding 
the different combinations of nucleotides at the two sites. For instance, the 
following matrix connects 2 sites, one of which can have nucleotide A or T and the 
other of which can have nucleotide G or C. The firaction of individuals in the 
population with A at site 1 and G at site 2 is 0.15. 

15 





A 


T 


G 


0.15 


0.40 


C 


0.40 


0.05 



20 



In general, the matrix is given by 
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Site 1- 
Allele 1 


Site 1 - 
Allele 2 




Site 2 - 
Allele 1 


Pu 


Pl2 


Pi. 


Site 2 - 
Allele 2 


Pii 


P22 


Pi. 




P*i 


P*2 





The values p^^ and respective rows 

while the values and p^^ 6^^^ ^® sum over the respective colunms. By 
definition, p,^ + p^^ = p^, + p^2 • Three standard measures of linkage 
disequilibrium that are used are: (Ref. 1 , Chapter 3) 
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(1) 



A = 



D 



(2) 



(Pl|Xp22><A2XP2l)' 



J/2 



D 



D>0 



D 



D<0 



(3) 



FIGURE 12 is an example of a screen showing a measure of 



the linkage between different polymorphic sites in the gene. Measures of linkage 
tell how well we can predict the nucleotide at one polymorphic site given the 
nucleotide at another site. A high value of the linkage measure indicates a high 
level of predictive ability. This screen shows D'. The color of the square in the 
display at the intersection of site a and P indicates the value of the linkage measure. 
Red indicates strong linkage and blue indicates weak to non-existent linkage. White 
squares in a row indicate that the corresponding polymorphic site has no variation in 
the population being exainined. Such sites are included because there is information 
about the presence of polymorphisms other than that provided by our haplotype 
analysis. This woidd be the case if a polymorphism was reported in the literature 
which we were not able to detect in our population. The values to the right of the 
matrix give I^p for each of the sites. I„^f» is a measure of the information content 
of the single site and is given by 



where jV^^ is the number of distinct haplotypes observed, 
P{J) is the probability of finding haplotype j\ and P(j\i) is the conditional 
probability of finding haplotype / with nucleotide /. (The conditional probability 



HAP 




(4) 
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P{ j I /) is the probability of finding haplotype j in the subset of ail observations 
where nucleotide i is seen.) High values of /^^^ (-^2.0) indicate that at least some 
pairs of observed haplotypes can be distinguished by looking at that single site. 
Small values (1 .0) indicate that the particular site is not informative for 
^ distinguishing any pair of haplotypes. This same method can be used for sub- 

haplotypes. These values are useful for choosing sites for genotyping, as described 
above. The + and - boxes are for zooming in and out. 

FIGURE 13, 14, and 15 show views of a tool for performing 
jQ an analysis of which polymorphic sites may be genotyped in order to determine an 
individual's haplotypes by the method of predictive haplotyping, rather than using 
more expensive direct haplotyping methods, such as the CLASPER-System™ 
method of haplotyping. In these screens, one chooses a subset of polymorphic sites 
of interest (the entire haplotype or a sub-haplotype can be examined) and then a 
subset of sites at which the subject is to be genotyped. The colors in the haplotype- 
pair boxes then indicate the fraction of individuals in that box who are correctly 
haplotyped based on the statistical model described in the previous paragraph. 
FIGURE 14 gives the predicted values and FIGURE 15 shows a tool for directly 
20 finding the optimal set of genotyping sites. 

The purpose of the three screens in FIGURE 1 3, 14 and 1 5 is 
to provide an example of the tools to find the simplest genotyping experiment that 
could detect an individual's haplotypes. The basic layout of the screen in FIGURE 
13 is the same as described in FIGURE 10. The top row of checkboxes is used to 
the haplotype or subhaplotype which is desired to be determined. There is one other 
row of checkboxes beneath those for choosing the haplotype or sub-haplotype. This 
second row, labeled "Genotype Loci", allows the user to select a subset of positions 
at which to genotype. The color of the square in the matrix indicates the firaction of 
individuals who are actually in that category who would be correctly categorized 
using this sub-genotype. For example, this screen shows that individuals 
homozygous for TGG at positions 2, 3, and 8 would be correctly haplotyped by 
genotyping at positions 2 and 8. Selection of optimal genotyping sites is aided by 
35 information from the Linkage View (FIGURE 12). Typically one will only need to 
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genotype one site of a pair of polymorphic sites that are in strong linkage. 

The screen in FIGURE 14 gives a numerical view of the data 
show in FIGURE 13. One can see that if we genotype at sites 2 and 8, one could 
assign individuals to the TGG/TGG group with 100% confidence (based on the data 
5 obtained for the reference population). However, one would have low confidence in 
the ability to assign individuals to the CAG/CGG group. 

FIGURE 15 is an example of a screen showing the results of 
a tool for directly finding the optimal genotyping sites. This screen gives the results 
of a simple optimization approach to finding the simplest genotyping approach for 
predicting an individual's haplotypes. For each haplotype pair, the predictive 
abilities of all single site genotyping experiments are calculated. If any of these has 
a predictive ability of greater than some cutoff (say 90%), then that single-site 
genotype test is shown, A single-site genotype test is one in which an individual's 

15 nucleotide(s) is found at that single site. This can be done using any of several 

standard methods including DNA sequencing, single-base extension, allele-specific 
PGR, or TOF-mass spec. (In the figure, a red box indicates that individuals should 
be genotyped at that site, and a white box indicates that the individual should not be 

2Q genotyped there.) If no single-site test has a predictive ability of greater than the 
cutoff, then the calculated predictive ability of all 2-'site genotyping tests are 
examined by the computer program. The first 2-site test whose predictive ability 
exceeds the cutoff is then displayed. If no 2-site test is successful, then the 
predictive ability of ail 3-sites tests are examined by the computer program, and so 

25 

on. The mask at the right hand side of this display shows the first test found that 
exceeded the cutoff value. 

An improved rhethod for finding optimal genotying sites is 
described in section D, below. 
30 FIGURES 16 and 1 7 are examples of screens demonstrating 

another tool for analyzing linkage. This tool is a minimal spanning network which 
shows the relatedness of the haplotypes seen in the population (Ref. 8). Haplotypes 
. are amenable to modes of analysis that are not available for isolated variants (e»g., 
SNPs). In particular, a sample of haplotypes reflects the actual phylogenetic history 
of the genetic locus. This history includes the divergence patterns among the 
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haplotypes, the order of mutational and recombinational events, and a better 
understanding of the actual variation among the different populations comprising 
the sample. These considerations are important in the assessment of a locus's 
involvement in a particular phenotype (e.g., differential response to a drug or 
adverse side effects). The phylogenetic algorithms included in the DecoGen™ 
application are both exploratory and analytical tools, in that they allow 
consideration of partial haplotypes as well as those based on the full set of 
haplotypes in the context of clinical data. The checkboxes and recalculate button 
shown in FIGURES 16 and 17 serve the purpose of selecting sub-haplotypes as 
described under FIGURE 10. The results of the calculations big shown in real time, 
i.e., the sizes and positions of the balls, as well as the length of the lines, change as 
the calculation progresses. Here a circle represents a haplotype* The distance 
between haplotypes is a rough measure of the number of nucleotides that would 
15 have to be flipped to change one haplotype into the other. Pairs of haplotypes 

separated by one nucleotide flip are connected with black lines. Pairs connected by 
2 flips are connected with light blue lines. The size of the haplotype ball increases 
with the frequency of that haplotype in the population. Each haplotype or sub- 
2Q haplotype ball is labeled with the relevant nucleotide string. The user can toggle the 
labels off and on by selecting the haplotype ball, e.g.^ with a mouse. The + and - 
boxes are for zooming in and out. The "View Hap Pairs" box serve the purpose of 
showing the pairing information for haplotypes. The lines shown in this figure are 
replaced with lines connecting pairs of haplotypes seen in each individual. The 
colors in the balls, and the pie shaped pieces, represent the fraction of that haplotype 
found in the major ethnogeographic group. Red represents Caucasian, blue African- 
American, Light Blue Asian, Green Hispanic/Latino. The Minimum Size checkbox 
allows the user to select sub-haplotypes as in earlier Figiu-es (see FIGURE 10). 
30 This aspect of the invention relates to a graphical display of 

the haplotypes (including sub-haplotypes) of a gene grouped according to their 
evolutionary relatedness. As used herein, "evolutionary relatedness" of two 
. haplotypes is measured by how many nucleotides have to be flipped in one of the 
haplotypes to produce the other haplotype. 

In one embodiment, the display is a minimal spanning 
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network in which a haplotype is represented by a symbol such as a circle, square, 
triangle, star and the like. Symbols representing different haplotypes of a gene may 
be visually distinguished from each other by being labeled with the haplotype and/or 
may have different colors, different shading tones, cross-hatch patterns and the like. 
Any two haplotype symbols are separated from each other by a distance, referred to 
as the ideal distance, that is proportional to the evolutionary relatedness between 
their represented haplotypes. For example, if displaying a group of haplotypes 
related by one, two or three nucleotide flips, the proportional distances between the 
haplotype symbols could be one inch, two inches, and three inches, respectively. 
The haplotype symbols may be connected by lines, which may have different 
appearances, i.e., different colors, solid vs. dotted vs. dashed, and the like, to help 
visually distinguish between one nucleotide flip, two nucleotide flips, three 
nucleotide flips, etc. 

15 In a preferred embodiment, the method is implemented by a 

computer and the graphical display is produced by an algorithm that connects 
haplotype symbols by springs whose equilibriimi distance is proportional to the 
ideal distance. Preferably, the size of a particular haplotype symbol is proportional 
2Q to the frequency of that haplotype in the population. In addition, the haplotype 

symbol may be divided into regions representing different characteristics possessed 
by members of the population, such as ethnicity, sex, age, or differences in a 
phenotype such as height, weight, drug response, disease susceptibility and the like. 
The different regions in a haplotype symbol may be represented by different colors, 
shading tones, stippling, etc. In a particularly preferred embodiment, generation of 
the graphical display is shown in real time, i.e., the positions and sizes of haplotype 
symbols, as well as the lengths of their coimecting springs, change as the algorithm- 
directed organization of the haplotypes of a particular gene proceeds. 
30 The resulting display provides a visual impression of the 

phylogenetic history of the locus, including the divergence patterns among the 
haplotypes for that locus, as well as providing a better understanding of the actual 
. variation among the different populations comprising the sample. These 

considerations are important in the assessment of the encoded protein's involvement 
in a particular phenotype (e.g., differential response to a drug or adverse side 
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effects). In addition, a spanning network generated for haplotypes in a clinical 
population using the same algorithm may be superimposed on the spanning network 
for the reference population to analyze whether the haplotype content of the clinical 
population is representative of the reference population. 

7. A trial population of individuals who suffer from the 
condition of interest is recruited. 

The end result of the CTS method is the correlation of an 
underlying genetic makeup (in the form of haplotype or sub-haplotype pairs for one 
or more genes or other loci) and a treatment outcome. In order to deduce this 
correlation it is necessary to run a clinical trial or to analyze the results of a clinical 
trial that has already been run. Individuals who suffer from the condition of interest 
are recmited. Standard methods may be used to define the patient population and to 
enroll subjects. 

15 Individuals in the trial population are optionally graded for 

the existence of the underlying cause (disease/condition) of interest. This step will 
be important in cases where the symptom being presented by the patients can arise 
from more than one underlying cause, and where treatment of the underlying causes 
are not the same. An example of this would be where patients experience breathing 
difficulties that are due to either asthma or respiratory infections. If both sets were 
included in a trial of an asthma medication, there would be a spurious group of 
apparent non-responders who did not actually have asthma. These people would 
degrade any correlation between haplotype and treatment outcorne. 

This grading of potential patients could employ a standard 
physical exam or one or more lab tests. It could also use haplotyping for situations 
where there was a strong correlation between haplotype pair and disease 
susceptibility or severity. 
30 8. Individuals in the trial population are treated using 

some protocol and their response is measured. In addition, they are haplotyped, 
either directly or using predictive genotyping. 

This step is straightforward. If patients are to be haplotyped 
for the candidate genes, a direct molecular haplotyping method could be used. If 
they are to be indirectly haplotyped, a method such as the one described above in 
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item 6 could be used. Clinical outcomes in response to the treatment are measured 
using standard protocols set up for the clinical trial. 

9. Correlations between individual response and 
haplotype content are created for the candidate genes. From these correlations, a 
mathematical model is constructed that predicts response as a function of haplotype 
content. 

Correlations may be produced in several ways. In one 
method averages and standard deviations for the haplotype-pair groups may be 
calculated. This can also be done for sub-haplotype-pair groups. These can be 
displayed in a color coded manner with low responding groups being colored one 
way and high responding groups colored another way (see, e.g., FIGURE 1 8). 
Distributions in the form of bar graphs can also be displayed (see, e.g., FIGURE 
19), as can all group means and standard deviations (see, e.g., FIGURE 20), 
15 The information in FIGURES 1 8-24 may be used to 

determine whether haplotype information for the gene being examined can be used 
to predict clinical response to the treatment. One question that can be answered is 
whether there is a significant difference in response between groups of individuals 
2Q with different haplotype pairs, FIGURES 1 8-22 show screens of the data that 

connect haplotypes with clinical outcomes. The example shown in FIGURE 18 and 
the next several screens gives the results of a simulated clinical trial run to test the 
link between patients' haplotypes for CYP2D6 and a phenotypic response called 
"Test". The main layout of this page is the same as described in FIGURE 10. At the 
left side of this view is a list of the clinical measurements performed on the patients. 
This list is completely generic as far as the invention is concerned. Selecting the 
relevant radio button will bring up data for any of the clinical measurements. (Only 
one "Test" radio button shown here, but there may be many, corresponding to 
30 different tests, with appropriate labels.) In this view, the color in a cell of the matrix 
indicates the mean value of the measurement for the individuals in that haplotype- 
pair group. When one of the cells is selected, text appears at the right, giving the 2 
. haplotypes, the number of patients in the cell, the mean value and standard deviation 
for individuals in the cell. A slide bar is present below the color boxes near the top 
of the screen indicating 0% to 100% so that moving, e.g., one or both of the ends of 
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the bar will change the color scale in the color boxes at the top of the screen as well 
as the colors in the matrix. (Note that a slide bar may be used with ay screen with 
similar colored (or otherwise graded) boxes). FIGURE 19 is a screen showing the 
distribution of the patients in each cell of the clinical measurement matrix of 
5 FIGURE 1 8. In this case, the histograms are collectively normalized so that the user 
can directly compare frequencies from one cell to the next. The screen in FIGURE 
20 is brought up when the user selects any of the cells in the haplotype-pair matrix 
in FIGURE 19. This shows the number of patients in the various response bins 
indicated on the horizontal axis. A response bin simply counts the number of 
individuals whose response is within a particular interval. For instance, there are 7 
individuals in the response bin from 0.2 to 0.25 in FIGURE 20. 

The result of regression calculation shown in FIGURE 21 
(which calculation is described below) allows the user to see which polymorphic 

15 sites give the most significant contribution to the differences in phenotype. This 
display comes up in a separate window when the user pushed the "Regression'' 
button on the "Clinical Measurements vs. Haplotype Viev/' (FIGURES 18, 19, or 
21). Shown are the results of a dose-response linear regression calculation on each 

2Q of the individual polymorphisms (REF 4, Chapter 9). In this case, sites 2 and 8 are 
most predictive, as indicated by their large values of the significance level. This 
fact would lead the user to examine the site 2/8 sub-haplotypes as in FIGURE 22. 
This screen gives a detailed view of the mean and standard deviation values for each 
of the cells in FIGURE 18. Also shown are the Chi-squared value for the 

25 

distributions. These values indicate how close the distributions in each haplot3^e-- 
pair group are to normal. The function Q(chi-$quaied) gives a level of statistical 
significance. If Q>O.OS the user could not reject the hypothesis that the distribution 
is normal. FIGURE 22 shows that groups having different 2/8 sub-haplotypes can 
30 have very different mean values of the Test phenotype. To see if this grouji-to- 
group variation is significant, the user could ask the DecoGen^*^ application to 
perform an ANOVA (Analysis of Variation) calculation. The results of an ANOVA 
calculation are shown in FIGURE 23. Selecting the ANOVA button on any of the 
earlier Clinical Measurements views brings up this display. This view uses standard 
calculation methods to see if the variation in clinical response between haplotype- 
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pair groups is statistically significant. The methods used are described in Ref. 4, 
Chapter 10. FIGURE 23 shows that the variation between different 2/8 sub- 
haplotype groups is statistically significant at the 99% confidence level. 

The regression model used in FIGURE 21 starts with a model 

5 of the form 

r^r^+S>id (5) 

where r is the response, is a constant called the 

"intercept", S is the slope and d is the dose. As discussed previously, the most- 
common nucleotide at the site and the least common nucleotide are defined. For 
each individual in the population, we calculate his "dose" as the number of least- 
common nucleotides he has at the site of interest. This value can be 0 (homozygous 
for the least-common nucleotide), 1 (heterozygous), or 2 (homozygous for the most 
common nucleotide). An individual's "response" is the value of the clinical 
measurement. Standard linear regression methods are then used to fit all of the 
individuals' dose and response to a single model. The outputs of the regression 
calculation are the intercept , the slope 5, and the variance (which measures how 
well the data fits this simple linear model). The Students t-test value and the level 
of significance can then be calculated. This figure shows the relevant variables 
(site, slope S, intercept , variance. Student's t-test value and level of significance) 
for each of the sites. 

From the results shown in FIGURE 2 1 , the user would see 
that the nucleotides at site 2 and 8 have significant contributions to the Test 
variable. This result would be interpreted as follows. Averaging over all variables 
other than the nucleotides at site 2, the Test variable can be predicted by 
Test = 0.23 1 + 0. 1 54 X (number of T's at site 2). 
On average, an individual homo2ygous for C at site 2 will 
have a response of 0.23 1 . Heterozygous individuals have an average response of 
0.385, and individuals homozygous for T have an average response of 0.539. This 
trend is significant at the 99.9% confidence level. It is important to note that the 
calculation of significance (the Student's t-test) is based on the assumption that the 
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distribution of responses for individuals (such as seen in FIGURE 20) are normally 
distributed. The present invention can incorporate any of the standard methods for 
calculating statistical significance for non-normal distributions. Furthermore, the 
present invention can include more complex dose-response calculations that 
5 examine multiple sites simultaneously. See, e.g., Ref 4. 

A second method for finding correlations uses predictive 
models based on error-minimizing optimization algorithms. One of many possible 
optimization algorithms is a genetic algorithm. (Ref. 5). Simulated annealing (Ref. 
6, Chapter 10), neural networks (Ref 7, Chapter 18), standard gradient descent 

10 

methods (Ref. 6, Chapter 10), or other global or local optimization approaches (See 
discussion in Ref. 5) could also be used. As an example (one that is currently 
implemented in the DecoGen^*^ application) a genetic algorithm approach is 
described herein. This method searches for optimal parameters or weights in linear 
15 or non-linear models connecting haplotype loci and clinical outcome. One model is 
of the form 

c = c, + x[Z>*'i^^.^ (6) 

20 

where C is the measured clinical outcome, / goes over all 
polymorphic sites, a over all candidate genes, Cq, w,^ and are variable weight 

values, R.^ is equal to 1 if site / in gene a in the first haplotype takes on the most 

common nucleotide and -1 if it takes on the less common nucleotide. L^^ is the 

same as R^^ except for the second haplotype. The constant term Q and the 

weights w^ ^ and w' ^ are varied by the genetic algorithm during a search process 

that minimizes the error between the measured value of C and the value calculated 
from Equation 6. Models other than the one given in Equation 6 can be easily 
incorporated. The genetic algorithm is especially suited for searching not only over 
the space of weights in a particular model but also over the space of possible 
models. (Ref. 5) 

Correlations can also be analyzed using ANOVA techniques 
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to determine how much of the variation in the clinical data is explained by different 
subsets of the polymorphic sites in the candidate genes« The DecoGen™ 
application has an ANOVA function that uses standard methods to calculate 
significance (Ref. 4, Chapter 10). An example of an interface to this tool is shown 
in FIGURE 23. 

ANOVA is used to test hypotheses about whether a response 
variable is caused by or correlated with one or more traits or variable that can be 
measured. These traits or variables are called the independent variables. To carry 
out ANOVA, the independent variable(s) are measured and people are placed into 
groups or bins based on their values of the variables. In this case, each group 
contains those individuals with a given haplotype (or sub-haplotype) pair. The 
variation in response within the groups and also the variation between groups is then 
measured. If the within-group variation is large (people in a group have a wide 
range of responses) and the variation between groups is small (the average 
responses for all groups are about the same) then it can be concluded that the 
independent variables used for the grouping are not causing or correlated vdth the 
response variable. For instance, if people are grouped by month of birth (which 
should have nothing to do with their response to a drug) the ANOVA calculation 
should show a low level of significance. Here, as shown in FIGURE 23, each 
haplotype-pair group is made up of the individuals in the population ^o have that 
haplotype pair. The table at the bottom shows the number of individuals in the 
group^ the averse response ("Test'*) of those individuals, and the standard deviation 
of that response. At the top is a table showing information comparing the ^^Between 
Group" calculation and the "Within Group" calculations. The details are given in 
the reference. [Ref, 4] If the variation (the "Mean Squares" column) is larger for 
the "Between Groups" than for the "Within Groups" set, we will have an F-ratio 
(="Between Groups" divided by "Within Groups") greater than one. Large values 
of the F-ratio indicate that the independent variable is causing or correlated with the 
response. The calculated F-ratio is compared with the critical F-distribution value at 
whatever level of significance is of interest. If the F-ratio is greater than the Critical 
F-distribution value, then the user may be confident that the independent variable is 
predictive at that level. In this example, the user may woidd see that grouping by 
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haplotype-pair for sites 2 and 8 for CYP2D6 gives significant probability at the 99% 
confidence level. The conclusion from this is that an individual's haplotjqjes at 
these positions in this gene is at least partially responsible for, or is at least strongly 
correlated with the value of Test. 

FIGURE 24 shows a screen which is an example interface to 
the modeling tool (i.e., the CTS™ Modeler) described herein. At the right are 
controls to set the parameters for the genetic algorithm (Ref. 5). In the center is a 
graph showing the residual error of the model as a function of the number of genetic 
algorithm generations. At the bottom is a bar graph showing the current best 
weights for Eq, 6. In this example, the linear model described in Eq. 4 is used to 
find optimal weights for the polymorphic sites. The final parameters arrived at are 
Co = 0. 1 and w^ cvpioe 1 5 and '^'^jcypids 1 . This says that the response 
variable "Test" can be predicted from the formula: 

Test = 0.1 -I- [.15 X (Number of Cs in position z) + 0.1 x (Number of As in position 
8)] X 2 where "number" refers to the number in the two haplotypes for an individual. 

1 0. Preferably, follow-up trials are designed to test and 
validate the haplotype-response mathematical model. 

20 The outcome of Step 9 is a hypothesis that people with 

certain haplotype pairs or genotypes are more likely or less likely on average to 
respond to a treatment. This model is preferably tested directly by running one or 
more additional trials to see if this hypothesis holds. 

11. A diagnostic method is designed (using one or more 
of haplotyping, genotyping, physical exam, serum test, etc.) to determine those 
individuals who will or will not respond to the treatment. 

The final outcome of the CTS™ method is a diagnostic 
method to indicate whether a patient will or will not respond to a particular 
treatment. This diagnostic method can take one of several forms — e.g., a direct 
DNA test> a serological test, or a physical exam measurement. The only 
requirement is that there is a good correlation between the diagnostic test results and 
the underlying haplotypes or sub-haplotypes that are in turn correlated with clinical 
35 outcome. In the preferred embodiment, this uses the predictive genotyping method 
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described in item 6. 

2. Illustration With ADRB2 Gene 

Figure 26 is the opening screen for the Asthma project. This 
screen appears after the "Asthma" folder has been selected from among the projects 
shown at the left. Selecting a folder causes the genes associated with that project to 
become active. Genes known or suspected of being involved in asthma are shown 
in the screen in "Extracellular" and "Intracellular** compartments* The text "Active 
Gene: DAXX" is a de&ult value; "DAXX" will be replaced with the name of 
whatever gene is selected from this window. Selecting ADRB2, and then 
"Geneinfo" from the menu at left, brings up Figure 27. 

Figure 27 presents data and statistics related to the ADBR2 
gene. Selecting "GeneStructure" from the menu at left brings up Fig. 28A. 
15 Figure 28 A is a screen showing the genomic structure of the 

ADBR2 gene (showing the location of features of the gene, such as promoters, 
exons, introns, 5' and 3' untranslated regions), polymorphism and haplotype 
information, and the number of times each haplotype was seen in the representatives 
of each of 4 world population groups. The column "Wild'* contains the number of 
individuals homozygous for the more common nucleotide at each polymorphic site, 
"Mut** contains the number homozygous for the less common nucleotide, and "Het" 
is the number of heterozygous indi^duals. Overlaid on the two graphical gene 
representations at the upper part of the screen are vertical bars, indicating the 
25 positions of the polymorphic sites elaborated in the middle box. The user may 

scroll through the lower boxes to bring different portions of the polymorphism and 
haplotype data into view. Selecting row 6 in the middle window results in Figure 
28B. 

3Q Figure 28B is a screen where a particular polymorphic site 

has been selected in the middle box. The upper graphical representation of the gene 
has been replaced by a textual representation, presented as a nucleotide sequence 
aligned with the lower graphical representation at the point of the selected 
polymorphic site (indicated by the black triangles). At the polymorphic site, the two 
observed nucleotides (T and C) are displayed. Selecting "Patient table" from the 
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menu at left brings up Fig. 29A. 

Figure 29 A presents genealogical information and diplotype 
and haplotype data for individuals within the database. Shaded rectangles within 
the table represent missing data. Within the rectangles and ovals are the ID 
5 numbers of the individuals; below each of these in the upper genealogical chart are 
the two haplotypes of the ADBR2 gene present in that individual, identified by 
number. The nucleotides comprising these haplotypes are displayed in the box at 
the lower right. Selecting ''Clinical Trial Data" from the menu at left brings up Fig. 
29B. 

10 

Figure 29B presents the clinical data sorted by individual 
patient. Severity scores. Skin Test results, and the clinically measured parameters 
described elsewhere are set out in colimins. "NP'* stands for "No data Point", and 
represents data missing for any reason. Selecting "HAPSNP" from the menu at left 
15 brings up Fig. 30. 

Figure 30 presents, for each patient, a row of color-coded (or 
shaded) squares representing the heterozygosity of the patient at each polymorphic 
site. These are adjacent to a row of split squares, where the same information is 
2Q presented in a two-color (or shaded) format. Selecting the HAPPair command from 
the menu at the left brings up Fig. 3 1 . 

Figure 31 presents the "HAP Pair Frequency View" in which 
the world population distribution of haplotype or sub-haplotype pairs can be 
investigated. In this window, polymorphic sites 3, 9, and 1 1 have been selected by 

25 

checking the corresponding boxes above the haplotypes. Each cell in the matrix 
below corresponds to a haplotype pair identified by the HAP numbers on the x and 
y axes. The height of the color-coded (or shaded) bars within each cell corresponds 
to the number of individuals of each population group having that haplotype pair. 
30 Clicking on the V/D button at the top of the screen toggles between Fig, 3 1 and 32. 

Figure 32 shows the same data in tabular form. In this figure 
all SNPs have been selected, so the haplotypes being evaluated consist of thirteen 
. polymorphic sites. Each row in the table corresponds to a haplotype pair (the two 
haplotypes which comprise the pair are identified in the first two columns), 
followed by the number of individuals in the database having that pair, and the 
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percentage of the total population this number represents. Under each population 
group three colunrms presenting the number of individuals in the population group 
with that pair, the percentage of the population group that has that pair, and the 
percentage predicted by Hardy-Weinberg equilibrium. Selecting "Linkage" from 
5 the menu at left brings up Fig. 33. 

Figure 33 displays separate matrices for the total population 
and for each population group. Each cell is color-coded (or shaded) to indicate the 
extent to which the two haplotypes occur together in individuals, i.e., the degree to 
which they are linked. Selecting "HAPTyping" from the menu at left brings up the 
screen in Fig. 34. 

Figure 34 presents the ambiguity scores that result from 
masking one or more SNPs or polymorphisms in the genotype. The ambiguity 
scores are calculated by taking the sum of the geometric means of all pairs of 

15 genotypes rendered ambiguous by the mask, and multiplying by ten. All population 
groups have been chosen for inclusion in this figure by checking off the boxes at the 
upper left of the screen. The list of haplotype pairs has been sorted by the 
calculated Hardy-Weinberg frequency, and the pairs have been numbered 

2Q consecutively, as shown in the first column. 

A mask that causes SNP 8 to be ignored in all cases has been 
imposed by deselecting the appropriate box in the "Choose SNP" row above the 
haplotype list. Additional masking has been imposed by deselecting the appropriate 
boxes in the mask to the right of the Genotype table. (The mask is to the right of the 

25 

table and may be accessed by scrolling horizontally; in the figure it has been re- 
located to bring it into view.) In the first mask, only SNP 8 is ignored, which results 
in haplotype pairs 4 and 73 both being consistent with the genotype observed. (In 
other words, the genotypes derived from haplotype pairs 4 and 73 differ only at SNP 

30 8, and cannot be distinguished if it is not measured). An ambiguity score of 0.016 
is associated with this first mask. The frequency of haplotype pair 4 is much greater 
than that of haplotype pair 73 (recall that the list is sorted by frequency), so one 
. could resolve this ambiguity with some confidence simply by choosing haplotype 

2^ pair 4. (In an alternative embodiment, the probability of each choice being the 

correct one could be displayed.) For the present application, in general, the mask 
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with the largest number of ignored SNPs that retains an ambiguity score of about 
1 .0 or less will be preferred. The ambiguity score cut-off that is chosen may vary 
depending on the intended use of the inferred haplotypes. For example, if haplotype 
pair information is to be used in prescribing a drug, and certain haplotype pairs are 
5 associated with severe side effects, the acceptable ambiguity score may be reduced. 
In such a situation masks that do not render the haplotype pairs of interest 
ambiguous would be preferred as well. Selecting "Phylogenetic" from the menu at 
left brings up Fig. 35. 

Figure 35 presents haplotype data in a phylogenetic minimal 
spanning network. Each disk corresponds to a haplotype, the haplotype number is 
to the inunediate right of each disk. The size of each disk is proportional to the 
number of individuals having that haplotype; that number is displayed in 
parentheses to the right of each disk. Haplotypes that are closely related, that is they 

IS differ at only one polymorphic site, are connected by solid lines. Haplotypes that 
differ at two sites are connected by light lines, and are spaced farther apart. The 
colored (or shaded) wedges represent the firaction of individuals having that 
haplotype that are from different population groups. Selecting ''Clinical Haplotype 

20 Correlation" brings up the screen in Fig. 36. 

Figure 36 presents the association between a clinical outcome 
value (in this case, '^delta %FEV1 pred" which is the change in FEVl observed after 
administration of albuterol, corrected for size, age, and gender. The SNPs one 
wishes to test for association may be selected by checking off the appropriate box 

25 

above the HAP list table. The value of delta %FEV1 is represented in grayscale or 
by a color scale. Each cell in the matrix corresponds to a given haplotype pair, 
defined by the haplotype numbers on the x and y axes. The number in each cell is 
the number of patients having that haplotype pair, and the color (or shading) of each 

30 cell reflects the response of those patients to albuterol. In this case, groups of 
people with haplotype pairs shown in the red (or darkly shaded) boxes have the 
highest average response, e.g. haplotype pairs 3,4 and 3,5, (See also Fig. 41, which 
. presents numerical results showing that individuals with the^e haplotype pairs have 

2^ a high average response to albuterol.) Under the "Clinical Mode" menu heading at 
the top of the screen is a command that the user may use to toggle among Figs. 36» 
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37, 38, and 40, 

Switching to Fig. 37 in this manner displays a collection of 
histograms, one in each cell of a haplotype pair matrix. Selecting the 1,1 cell 
enlarges it, bringing up Fig, 38. 

Figure 38 is a histogram showing the number of individuals 
having the 1,1 haplotype pair who exhibited the response to albuterol shown on the 
X axis. The bars in the histogram are color-coded (or shaded) as well, as an 
additional indication of the degree of response. 

In either Fig. 36 or Fig. 37, there is a button with an icon of a 
small scatter plot (just below the Help menu at the top of the screen.) Selecting this 
button brings up Fig. 39A. This figure displays the regression calculations 
employed in the multi-SNP analysis, or "Build-up'* process. Given the confidence 
values shown, which are the default values for the *Hight cutoff' and "loose cutoff', 
15 the program generates pairwise combinations of SNPs, tests their p-values for 
correlation with "delta %FEV1 pred" against the cutoff values, and, from those 
subhaplotypes that pass the cut-offs, re-calculates and tests new pairwise 
combinations, until the number of SNPs in the subhaplotypes reaches the limit 
2Q shown in the "Fixed Site" box. In the example shovm, no four-SNP subhaplotype 
passed the loose cutoff, thus there are only 1-, 2-, and 3-SNP sub-haplotypes shown 
in this scieen. New values may be entered in the Confidence and Fixed site fields; 
clicking on the calculator button (under the File menu) re-executes the Build-up and 
Build-down processes with the entered values. 

A reverse SNP analysis, or "Build down" process, may also 
be carried out; tfie presence of the minus sign in the "Fixed Site" box indicates that 
this process is being requested. (In the example given, only a single "Build-down" 
round was executed, so as to ensure that the fiill haplotype is present for 
30 comparison.) 

For each "marker" (SNP, subhaplotype, or haplotype) in the 
left column, a regression analysis of the correlation of the number of copies of that 
. marker with the value of "delta %FEV1 pred" is generated, and selected statistical 
information is presented in the columns to the right. (A negative correlation 
coefScient (R) indicates that response to albuterol decreases with increasing copy 
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number of the indicated marker.) The SNPs or subhaplotypes exhibiting the lowest 
p values are identified as the ones that should most preferably be measured in 
patients in order to predict response to albuterol. Selecting the box to the left of the 
**A*****A*G** sub-haplotype brings up Fig. 39B. 

Figure 39B presents in a graphic form the calculation of the 
regression parameters displayed in Fig. 39 A. The values of "delta %FEV1 pred" for 
patients with 0, 1, and 2 copies of the **a*****A*G** subhaplotype are plotted 
vertically at three ordinates. A line is drawn through the three means, and the slope 
of the line is taken as an indication of the degree of correlation. The intercept, 
slope, slope range, R and values, and the p value associated with this line, are all 
listed in Fig. 39A. The "slope range" is a pair of limits, reflecting the standard 
deviation in the values of "delta %FEV1 pred". Mathematically, the p value listed 
in Fig. 39A is the probability that the slope is actually zero, Le, it is the probability 
IS that there is in fact no correlation. A lower value of p thus indicates greater 
reliability. 

Fig. 40 (reached through the "Clinical Mode" menu) displays 
the observed haplotype pairs, their distribution in the population, and the mean 
2Q clinical response (delta %FEVI pred.) of the patients having those haplotype pairs. 
Selecting the "normal" button (to the right of the scatter plot button) brings up Fig. 
41. 

Figure 41 shows a screen that displays the results of an 
ANOVA calculation in which patients were grouped according to haplotype pairs, 
and the average value of "delta %FEV1 pred." was analyzed both within the groups 
and between the groups. This permits one to determine which pairs of haplotypes 
are associated with the observed clinical response. All SNPs in the ADBR2 gene 
have been selected in the row of boxes labeled "Choose SNPs", thus the groups are 
30 the same as the cells in the matrix in Fig. 36. Groups containing one patient were 
ignored, leaving the seven groups listed at the bottom of the screen. This left six 
degrees of freedom (the parameter "DF") for inter-group comparisons. The 
. variation ("Mean Squares") is larger between groups than within groups, and the 
ratio of the two (F-ratio) is greater than one. (A large F-ratio indicates that the 
independent variable - the haplotype pair group - is correlated with the response.) 
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There is a significant difference (p =^ 0.027) between the mean square value of the 
clinical response between groups compared to that within groups. It is found in this 
example that being homozygous for haplotype 3 results in a significantly lower 
response (average 8,5%), while individuals with haplotype pair 3,4 (i.e., 
GCACCTTTACGCC and GCGCCTTTGCACA) show a good response to albuterol 
(average delta %FEV1 pred = 1 9.25%). This information is displayed in a more 
visual presentation in Fig. 36. 

Figure 42 is arrived at by selecting the "ClinicalVariables" 
command from the menu to the left of most of the previous screens. This is the 
same information displayed in Fig. 38, except that it is for the entire cohort rather 
than for a selected haplotype pair. The number of patients is plotted against the 
value of "delta %FEV1 pred". Note the outliers at 50% and 65% response. 
Selecting "ClinicalCorrelations" fi-om the menu to the left brings up Fig. 43. 
15 Figure 43 is a plot of each patient's "FEV1% PRE" (the 

normalized value of FEVl prior to administration of albuterol) against "delta 
%FEV1 pred". These variables are selected in the upper part of the screen. It is 
seen in this example that the response does not correlate with the initial value of 
FEVl. 
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IMPROVED METHODS 



1. Improved Method For Finding 
Optimal Gcnotvping Sites 



This aspect of the invention provides a method for 
determining an individual person's haplotypes for any gene vn\h reduced cost and 
effort. A haplotype is the specific form of the gene that the individual inherited 
from either mother or father. The 2 copies of the gene (one maternal and one 
paternal) usually differ at a few positions in the DNA locus of the gene. These 
positions are called polymorphisms or Single Nucleotide Polymorphisms (SNPs). 
The minimal information required to specify the haplotype is the reference 
sequence, and the set of sites where differences occur among people in a population, 
35 and nucleotides at those sites for a given copy of the gene possessed by the 
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individual. For the rest of this discussion, we assume that the reference sequence is 
given, and we represent the haplotype as a string of letters specifying the 
nucleotides at the variable sites. In almost all cases, only two of the possible 4 
nucleotides will occur at any position {e,g, A or T, C or G), so for generality we can 
represent the two values for alleles as 1 and 0. Therefore a haplotype can be 
represented as a string of Is and Os such as 001010100. In practicing this invention, 
one may make use of known methods for discovering a representative set of the 
haplotypes that exist in a population, as well as their frequencies. One begins by 
sequencing large sections of the gene locus in a representative set of members in the 
population. This provides (1) a determination of all of the sites of variation, and (2) 
the mixed (unphased) genotype for each individual at each site. For instance in a 
sample of 4 mdividuals for a gene with 3 variable sites, the mixed genotypes could 
be: 



Individual 


Genotype site 
1 


Genotype site 2 


Genotype site 3 


Haplotype of 1** 
allele 


Haplotype 
of 2"* allele 




1/1 


1/0 


1/0 


3 


4 


2 


0/0 


0/0 


0/0 


1 


1 


3 


1/0 


1/0 


0/0 


1 


2 


4 


1/1 


0/0 


1/0 


3 


5 



20 



This mixed set of genotypes could be derived firom the 



following haplo^es: 
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Haplotype 


Haplotype 


Frequency in 


No. 




population 


1 


000 


3 


2 


110 


1 


3 


100 


2 


4 


111 


1 


5 


101 


1 



A method for deriving the haplotypes from the genotypes is 
described in a separate patent filing. 

The haplotypes are a fundamental unit of human evolution 
and their relationships can be described in terms of phylogenetics. One 
consequence of this phylogenetic relationship is the property of linkage 
disequilibrium. Basically this means that if one measures a nucleotide at one site in 
a haplotype, one can often predict the nucleotide that will exist at another site 
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without having to measure it. This predictability is the basis of this aspect of the 
invention. Elimination of sites that do not need to be measured results in a reduced 
set of sites to be measured. 

Information from a previously measured set of individuals 
5 (who were measured at all sites) may be used to determine the minimum number (or 
a reduced number) of sites that need to be measured in a new individual in order to 
predict the new individual's haplotypes with a desired level of confidence. Since 
the measurement at each site is expensive, the invention can lead to great cost 
reduction in the haplotyping process. 

Step 1 : Measure the full genotypes of a representative cohort 

of individuals. 

Step 2: Determine their haplotypes directly, or indirectly 
)ie.g.y using one of several algorithms. 
15 Step 3: Tabulate the frequencies for each of these haplotypes. 

Note that Steps 1-3 are optional. The remaining steps only 
require that a database of haplotypes with frequencies exists. There are several 
ways to achieve this, but the above set of steps is the preferred route. 
2Q Step 4: Construct the list of all full genotypes that could come 

from the observed haplotypes. Note that only a subset of these vnll actually be 
observed in a typical sample, for example 100-200 individuals. 

Step 5: Predict the frequency of these genotypes from the 
Hardy- Weinberg equilibrium. If two haplotypes Hapl and Hap2 have frequencies 

25 

fl and f2, the expected frequency of the mix is 2 x fl x f2, or fl x f2 if Hapl and 
Hap2 are identical. 

Step 6: Go through this list and find all sites that, if they were 
not measured, would still allow one to correctly determine each pair of haplot3npes. 
30 For example, take the case where the three haplotypes A (1 1 1 1), B (1 1 10), and C 
(0000) exist in a population. The six genotypes that coidd be observed are derived 
from the six different pairs that are possible: 

Hap Polymorphic Site 
Pair 12 3 4 

^ 1. • A,A 1/1 1/1 1/1 1/1 

2. A,B 1/1 1/1 1/1 1/0 

3. A,C 1/0 1/0 I/O 1/0 
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4. B3 l/I 1/1 1/1 0/0 

5. B,C 1/0 1/0 1/0 0/0 

6. C,C 0/0 0/0 0/0 0/0 

Not measuring any one of the sites 1 -3 would still permit one 
to correctly assign a haplotype pair to an individual. From this we can see that any 
one of the first three positions, together with the fourth, carries all of the information 
required to determine which pair of haplotypes an individual has. 

Step 7: Extend the analysis of Step 6 as follows. Create a set 
of masks of the same length as the haplotype. A mask may be represented by a 
series of letters, e.^., Y for yes and N for no, to indicate whether the marked site is 
to be measured. For example, using the mask YNNY in the previous example, one 
would measure only sites 1 and 4, and one could use the information that only 
haplotypes 1111,1110, and 0000 exist to infer the haplotypes for the individuals. 
Masks N YN Y and NNYY would give equivalent information. If there are n sites, 
all combinations of Y and N produce 2" masks, of which 2"-l need to be examined 
(the all-N mask provides no information). 

Step 8: For each mask, evaluate how much ambiguity exists 
from this measurement of incomplete information. For example, one measure of 
20 ambiguity would be to take all pairs of genotypes that are identical when using, the 
mask, and multiply their frequencies. The product may be converted to the 
geometric mean. Then, for each mask, add up all such products for all ambiguous 
pairs to obtain an ambiguity score, which is used as a penalty factor in evaluating 
the value of the mask. The consequence of this would be to highly penalize masks 
that fail to resolve likely-to-be-seen genotypes into correct haplotypes, and masks 
that leave large numbers of genotypes ambiguous, such as the mask NNNY in the 
above example. This would give greater weight to masks that only confuse low 
frequency, low probability genotypes, A variety of other scoring schemes could be 
30 devised for this purpose. 

This approach is most preferably implemented by means of a 
computer program that allows a user to view the ambiguity score for each mask, and 
calculate the tradeoff between reduced cost and reduced certainty in the 
3^ determination of the haplotypes. 
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Step 8: Genotype new individuals using the optimal set of m 
sites (the optimal mask). In the example above, there are three equivalent optimal 
masks, YNNY, NYNY and NNYY, which require that only two of the four 
polymorphic sites be measured. (These masks have zero ambiguity.) 
5 Step 9: Derive these individuals' full n-site haplotypes by 

matching their m-site genotypes to the appropriate m-site genotypes derived from 
the n-site haplotypes of the initial cohort. If there is an ambiguity in the choice, the 
more common haplotype may be chosen, but preferably a haplotype pair will be 
chosen based on a weighted probability method as follows: 

If two haplotype pairs A and B exist that could explain a 
given genotype, the Hardy- Weinberg equilibrium will predict probabilities pa and 
Pb, where Pa + Pb = 1 . One chooses a random number between 0 and 1 . If the 
number is less than or equal to pa, the first haplotype pair A is assumed. If the 
15 number is greater than pA, the second pair is assumed. There are more complex 
variants of this algorithm, but this simple, unbiased approach is preferred. 

2. Improved Methods For Correlating Haplotypes 
With Clinical Outcome Variablefs) 

The following methods are described for correlating 
haplotypes, or haplotype pairs, with a clinical outcome variable. However, these 
methods are applicable to correlating haplotypes, and/or haplotype pairs, to any 
phenotype of interest, and is not limited to a clinical population or to applications in 
25 a clinical setting. 

a, Multi-SNP Analysis Method fBuild-Up Process^ 

This process is outlined in the flow chart shown in Figure 45. 
The first step (SI) is the collection of haplotype information and clinical data from a 

30 

cohort of subjects. Clinical data may be acquired before, during, or after collection 
of the haplotype information. The clinical data may be the diagnosis of a disease 
state, a response to an administered drug, a side-effect of an administered drug, or 
other manifestation of a phenotype of interest for which the practitioner desires to 
35 determine correlated haplotypes. The data is referred to as "clinical outcome 
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values." These values may be binary (e.g.^ response/no response, survival at 5 
months, toxicity/no toxicity, etc.) or may be continuous (e.g. liver enzyme levels, 
serum concentrations, drug half-life, etc.) 

The collection of haplotype information is the determination 
5 (e.g., by direct sequencing or by statistical inference) of a pattern of SNPs for each 
allele of a pre-selected gene or group of genes, for each individual in the cohort. 
The gene or group of genes selected may be chosen based on any criteria the 
practitioner desires to employ. For example, if the haplotype data is being collected 
in order to build a general-purpose haplotype database, a large number of clinically 
and pharmacologically relevant genes are likely to be selected. Where a 
retrospective analysis of a cohort fix>m an ongoing or completed clinical study is 
being carried out, a smaller number of genes judged to be relevant might be 
selected. 

15 The next step (S2) is the finding of single SNP correlations. 

Each indi^^dual SNP is statistically analyzed for the degree to which it correlates 
with the phenotype of interest. The analysis may be any of several types, such as a 
regression analysis (correlating the number of occurrences of the SNP in the 

20 subject's genome, Le. 0, 1, or 2, with the value of the clinical measurement), 

ANOVA analysis (correlating a continuous clinical outcome value with the presence 
of the SNP, relative to the outeome value of individuals lacking the SNP), or case- 
control chi-square analysis (correlating a binary clinical outcome value with the 
presence of the SNP, relative to the outcome value of individuals lacking the SNP). 

25 

In one embodiment, a *Hight cut-ofF' criterion is next applied 
to each SNP in turn. A first SNP is selected (S3) and its correlation with the clinical 
outcome is tested against a tight cut-off (S4). A typical value for the tight cut-off 
will be in the range p = .01 to .05, although other values may be chosen on empirical 
30 or theoretical grounds. If the SNP correlation meets the tight cut-off it is displayed 
to the user of the system (S5) (or, alternatively, stored for later display), and stored 
for later combination (S6). If the SNP correlation does not meet the tight cut-off it 
is tested against a "loose cut-ofF* (S7), typically in the range p = .05 to 0. 1 . Again, 
othercut-ofiFvaluesmay be chosen if desired for any reason. (User-selected tight 
and loose cut-off values are entered in the two boxes labeled ^'confidence" in Fig. 
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39a.) A SNP whose correlation meets the loose cut-off is stored for later 
combination (S6). Any SNP whose correlation does not meet either cut-off is 
discarded (S8), i.e., it is not considered further in the process. If there are SNPs 
remaining to be tested against the cut-offs (S9) they are selected (SIO) and tested 
5 (S4) in turn. 

In an alternative embodiment, a tight cut-off is not applied, 
and each SNP's correlation is tested directly against the loose cut-off, and the SNP 
is either saved or discarded. In this embodiment, correlations of pair-wise generated 
sub-haplotypes (see below) are also tested directly against the loose cut-off. If 
desired, SNPs and sub-haplotypes which are saved at the end of this alternative 
process may be measured against a tight cut-off, and those that pass may be 
displayed. 

When all SNPs have had their correlations tested^ the next 
15 step of the process consists of generating all possible pair-wise combinations (sub- 
haplotypes) of the saved SNPs. If novel (/.e. untested) sub-haplotypes are possible 
(Sll), which vfill be the case on the first iteration, they are generated by pair-wise 
combination of all saved SNPs (SI 2). The correlations of the newly generated* sub- 
2Q hapIoQpes with the clinical outcome values are calculated (SI 3), as was done for 

the SNPs. A first sub-haplotype is selected (SIS) and its correlation is tested against 
the tight and loose cut-offs (S4, S7) as described above for the SNP correlations. 
Each sub-haplotype is tested in turn, as described above, discarding any sub- 
haplotypes that do not pass the cut-off criteria and saving those that do pass. 

25 

When all sub*-haplotypes have been examined, the process 
generates new pair-wise combinations among the originally saved SNPs and the 
newly saved sub-haplotypes, and among all saved sub-haplotypes as well. The 
process may be iterated until no new combinations are being generated; 
30 alternatively the practitioner may interrupt the process at any time. In a preferred 

embodiment, the practitioner may set a limit to the number of SNPs permitted in the 
generated sub-haplotypes. (See Fig. 39a, where ''fixed site = 4" is a 4-SNP limit). 
. In this embodiment the system would then detennine if new combinations within 
the limit are possible prior to each pairwise combination step. 

In a preferred embodiment, complex redundant sub- 
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haplotypes are removed from the pair- wise generated sub-haplotypes (SI 4). 
Complex redundant sub-haplotypes are those which are constructed from smaller 
sub-haplotypes, where the smaller sub-haplotypes have correlation values that are at 
least as significant as that of the complex sub-haplotype, i.e. they have correlation 
5 values that account for the correlation value of the complex redundant sub- 

haplotype. In such cases the complex haplotype provides no additional information 
beyond what the component sub-haplotypes provide, which makes it redundant. 
The non-redundant haplotypes and sub-haplotypes that remain are those that have 
the strongest association with the clinical outcome values. These are saved for 

10 

future use (SI 6). 

b. Reverse SNP Analysis Method 
(Pare-Down Process) 

This aspect of the invention provides a method for 
discovering which particular SNPs or sub-haplotypes correlate with a phenotype of 
interest, when one has in hand single gene haplotype correlation values. The 
process is outlined in the flow chart illustrated in Fig. 46. 

The first step (S 1 7) is the collection of haplotype information 
and clinical data from a cohort of subjects. Clinical data may be acquired before, 
during, or after collection of the haplotype information. The clinical data may be 
the diagnosis of a disease state, a response to an administered drug, a side-eflfect of 
an administered drug, or other manifestation of a phenotype of interest for which the 
25 practitioner desires to determine correlated haplotypes. The data is referred to as 

"clinical outcome values." These values may be binary (e.g,, response/no response, 
survival at 5 months, toxicity/no toxicity, etc.) or may be continuous (e.g, liver 
enzyme levels, serum concentrations, drug half-life, etc.) 

The collection of haplotype information is the determination 

30 

(^.g., by direct sequencing or by statistical inference) of a pattern of SNPs for each 
allele of each of a pre-selected group of genes, for each individual in the cohort. 
The group of genes selected may be chosen based on any criteria the practitioner 
desires to employ. For example, if the haplotype data is being collected in order to 
35 build a general-purpose haplotype database, a large number of clinically and 
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pharmacologically relevant genes are likely to be selected. Where a retrospective 
analysis of a cohort from an ongoing or completed clinical study is being carried 
out, a smaller number of genes judged to be relevant might be selected. 

The next step (SI 8) is the finding of single-gene haplotype 
5 correlations. Each individual haplotype of each gene is statistically analyzed for the 
degree to which it correlates with the phenotype or clinical outcome value of 
interest. The analysis may be any of sever£d types, such as a regression analysis 
(correlating the number of occurrences of the haplotype in the subject's genome, le, 
0, 1, or 2, with the value of the clinical measurement), ANOVA analysis 

10 

(correlating a continuous clinical outcome value with the presence of the haplotype, 
relative to the outcome value of individuals lacking the haplotype), or case-control 
chi-square analysis (correlating a binary clinical outcome value with the presence of 
the haploptype, relative to the outcome value of individuals lacking the haplotype). 

IS In one embodiment, a '^ight cut-off" criterion is next applied 

to each haplotype in turn. A first haplotype is selected (S 1 9) and its correlation with 
the clinical outcome value is tested against a tight cut-off (S20). A typical value for 
the tight cut-off will be in the range p = ,01 to .05, although other values may be 

2Q chosen on empirical or theoretical grounds. If the haplotype correlation meets the 
tight cut-off it is displayed to the user of the system (S21) (or, alternatively, stored 
for later display), and stored for later combination (S22)« If the haplotype 
correlation does not meet the tight cut-off it is tested against a "loose cut-ofP' (S23), 
typically in the range p = .OS to 0. 1 . Again, other cut-off values may be chosen if 

25 

desired for any reason. A haplotype meeting the loose cut-off is stored for later 
combination (S22). Any haplotype whose correlation does not meet either cut-off is 
discarded (S24) , it is not considered further in the process. If there are 
haplotypes remaining to be tested against the cut-offs (S2S) they are selected (S26) 
30 and tested (S20) in turn. 

In an alternative embodiment, a tight cut-ofiT is not applied. 
The correlation of each haplotype is tested directly against the loose cut-off, and the 
. haplotype is either saved or discarded. In this embodiment, correlations of sub- 
haplotypes generated by masking (see below) are also tested directly against the 
loose cut-oflF. If desired, sub-haplotypes which are saved at the end of thus 
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alternative process may be measured against a tight cut-off, and those that pass may 
be displayed. 

When all haplotypes have had their correlations tested, the 
next step of the process consists of generating all possible sub-haplotypes in which a 
5 single SNP is masked, i.e. its identity is disregarded. If novel (i.e. untested) sub- 
haplotypes are possible (S27), which will be the case on the first iteration, they are 
generated by systematically masking each SNP of all saved haplotypes (S28). The 
correlations of the newly generated sub-haplotypes with the clinical outcome value 
are calculated (S29) , as was done for the haplotypes themselves. A first sub- 
haplotjrpe is selected (S30) and its correlation is tested against the tight and loose 
cut-offs (S20, S23) as described above for the haplotype correlations. Each sub- 
haplotype is tested in turn, as described above, discarding any sub-haplotypes that 
do not pass the cut-off criteria and saving those that do pass. 

IS Optionally, in a preferred embodiment, complex redundant 

haplotypes and sub-haplotypes are discarded after correlations are calculated for the 
sub-haplotypes and SNPs generated by the masking step (S31), Complex redundant 
haplotypes and sub-haplotypes are those which are constructed from smaller sub- 

2Q haplotypes or SNPs, where the smaller sub-haplotypes or SNPs have correlation 

values that are at least as significant as that of the complex sub-haplotype, i.e. they 
have correlation values that account for the correlation value of the complex 
redundant sub-haplotype. In such cases the complex haplotype or sub-haplotype 
provides no additional information beyond what its component sub-haplotypes or 

25 

SNPs provide, which makes it redundant. 

When all sub-haplotypes have been examined, the process 
generates new sub-haplotypes by masking SNPs among the newly saved sub- 
haplotypes. The process is prefembly iterated until no new sub-haplotypes are 
30 being generated; this may occur only when the sub-haplotypes have been reduced to 
individual SNPs. Alternatively the practitioner may intermpt the process at any 
time. 

The non-redundant sub-haplotypes and SNPs that remain are 
those that have the strongest association with the clinical outcome values. These are 
saved for future use (S32). 
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E. TOOLS OF THE BWENTION 

The methods of the invention preferably use a tool called the 
DecoGen™ Application. 

The tool consists of: 

^ a. One or more databases that contain (1 ) haplotypes for 

a gene (or other loci) for many individuals (i.e., people for the CTS™ method 
eipplication, but it would include animals, plants, etc. for other applications) for one 
or more genes and (2) a list of phenotypic measurements or outcomes that can be 

2Q but are not limited to: disease measurements, drug response measurements, plant 

yields, plant disease resistance, plant drought resistance, plant interaction with pest- 
management strategies, etc. The databases could include information generated 
either internally or externally (e.g. GenBank). 

b. A set of computer programs that analyze and display 
the relationships between the haplotypes for an individual and its phenotypic 
characteristics (including drug responses). 

Specific aspects of the tool which are novel include: 
a* A method of displaying measurements (such as 

20 quantitative phenotypic responses) for groups of individuals with the same group of 
haplotypes or sub-haplotypes, and thereby easily showing how responses segregate 
by haplotype or sub-haplotype composition. In the example herein, the display 
shows a matrix where the rows are labeled by one haplotype and the columns by a 
second. Each cell of the matrix is labeled either by numbers, by colors representing 
numbers, by a graph representing a distribution of values for the group or by other 
graphical controls that allow for further data mining for that group. 

b. A minimal spanning tree display (see, e.g., Ref. 8) 
showing the phylogenetic distance between haplotypes. Each node, which 
represents a haplotype, is labeled by a graphic that shows statistics about the 
haplotype (for example, fraction of the population, contribution to disease 
susceptibility). 

c. Numerical modeling tools that produce a quantitative 
35 model linking the haplotype structure with any specific phenotypic outcome, which 
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is preferably quantitative or categorical. Examples of outcomes include years of 
survival after treatment with anticancer drugs and increase in lung capacity after 
taking an asthma medication. This model can use a genetic algorithm or other 
suitable optimization algorithm to find the most predictive models. This can be 
extended to multiple genes using the current method (see Equation 5). Techniques 
such as Factor Analysis (Ref. 4, Chapter 14) could be used to find the minimal set of 
predictive haplotypes. 

d. A genotype-to-haplotype method that allows the user 
to find the smallest number of sites to genotype in order to infer an individuars 
haplotypes or sub-haplotypes for a given gene. An individuaPs haplotj^es provide 
unambiguous knowledge of his genetic makeup and hence of the protein variations 
that person possesses. As described earlier, the individual's genotype does not 
distinguish his haplotypes so there is ambiguity about what protein variants the 
15 individual will express. However, using current technology, it is much more 
expensive to directly haplo^e an individual than it is to genotype him. The 
method described above allows one to predict an individual's haplotypes, and 
therefore to make use of the predictive haplot3^-to-response correlation derived 
from a clinical trial. The steps required for this to work are (a) determine the 
haplotype fi:equencies from the reference population directly; (b) correct the 
observed frequencies to conform to Hardy- Weinberg equilibrium (unless it is 
determined that the derivation is not due to sampling bias as discussed above); and 
(c) use the statistical approach described in the third paragraph of item 6 above to 
predict individuals' haplotypes or sub-haplotypes from their genotypes. 

F. DATA/DATABASE MODEL 

The present invention uses a relational database which 
3Q provides a robust, scalable and releasable data storage and data management 

mechanism. The computing hardware and software platforms, with 7x24 teams of 
database administration and development support, provide the relational database 
with advantageous guaranteed data quality, data security, and data availability. The 
database models of the present invention provide tables and their relationships 
optimized for efficiently storing and searching genomic and clinical information, 
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and otherwise utilizing a genomics-oriented database. 

A data model (or database model) describes the data fields 
one wishes to store and the relationships between those data fields. The model is a 
blueprint for the actual way that data is stored, but is generic enough that it is not 
3 restricted to a particular database implementation (e.g., Sybase or Oracle). In the 
preferred embodiment of the present invention, the model stores the data required 
by the DecoGen application. 

1* Database Model Version 1 

10 

a. Submodels 

In one embodiment^ the database comprises S submodels 
which contain logically related subsets of the data. These are described below. 

1. Gene Repository (Fig. 25A): This submodel describes the 
gene loci and its related domains. It captures the information on gene, gene 
structure, species, gene map, gene family, therapeutic applications of genes, gene 
naming conventions and publication literature including the patent information on 
these objects. 

2. Population Repository (Fig. 25B): This submodel 
encapsulates the patient and population information. It covers entities such as 
patient, ethnic and geographical background of patient and population, medical 
conditions of the patients, family and pedigree information of the patients, patient 
haplotype and polymorphism information and their clinical trial outcomes. 

3. Polymorphism Repository (Fig. 25C): This submodel 
stores the haplotypes and the polymorphisms associated with genes and patient 
cohorts used in clinical trials. The polymorphisms may include SNPs, small 
insertions/deletions, large insertions/deletions, repeats, fi:ame shifts and alternative 
splicing. 

4. Sequence Repository (Fig. 25D): Genetic sequence 
information in the form of genomic DNA, cDNA, mRNA and protein is captured by 
this data submodel. What is more important in this model is the location 
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relationship between the gene structural features and the sequences. Patent 
information on sequences is also covered. 

5. Assay Repository (Fig, 25E): This submodel captures client 
companies, contact information, compounds used in the different disease areas and 
5 assay resuhs for such compounds in regards to polymorphisms and haplotypes in 
target genes. 

A model or sub-model is a collection of database tables. A 

table is described by its columns, where there is one column for each data field. For 

instance the table COMPANY contains the following 3 columns: COMPANY ID, 
10 o _ , 

COMPANY_NAME, and DESCR. COMPANY JD is a unique number (1, 2, 3, 
etc.) assigned to the company. COMPANY_NAME holds the name (e.g., 
"Genaissance") and DESCR holds extra descriptive information about the company 
(e.g., "The HAP Company"). There will be one row in this table for each company 

15 for which data exists in the database. In this case COMPANY_ID is the "primary 

key" which requires that no two companies have the same value of COMPANY ID, 
i.e., that it is unique in the table. Tables are connected together by '^relationships". 
To understand this, refer to Figure 25E which shows the table 

2Q COMPANYADDRESS. It has fields COMPANYJD, STREET, CITY, etc. In this 
table the field COMPANYJD refers back to the table COMPANY. If a company 
has several locations, there will be several rows in the table COMPANYADDRESS, 
each with the same value of COMPANY_ID. For each of these we can get the 
name and description of the company by referring back to the COMPANY TABLE. 

25 

b. Abbreviations 

The following abbreviations are used in FIGURES 25A-E 
and the tables describing the database model depicted therein: 



30 



AA 


amino acid 


Clin 


clinical 


Descr 


description 


FK 


foreign key 


Geo 


geographical 
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10 



Hap 


Haplotype 


ID 


identifier 


Loc 


location 


Mol 


molecule 


NT 


nucleotide 


PK 


primary key 


Poly 


polymorphism 


Pos 


position 


Pub 


publication 


QC 


quality control 


Seq 


sequence 


SNP 


single nucleotide polymorphism 


Therap 


therapeutic 



15 



c. Tables 

In this embodiment of the present invention, the database 
contains 76 tables as follows: 

20 



25 



30 



J) 


' Accession 


2) 


Assay 


3) 


AssayResidt 


4) 


BioSequence 


5) 


ChromosomeMap 


6) 


ClasperCIone 


7) 


ClinicalSite 


«) 


Company 


9) 


CompanyAddress 


10) 


Compound 


») 


CompoundAssay 


12) 


Contact 


13) 


FamilyMember 
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1 4) FamilyMemberEthnicity 

15) Feature 

1 6) FeatureAccession 

1 7) FeatureGeneLocation 
5 Featurelnfo 

19) FeatureKey 

20) FeatureList 

21) FeaturePub 

. ^ 22) Gene 

10 

23) GeneAccession 

24) GeneAlias 

25) GeneFamily 

26) GeneMapLocation 
15 27) GenePathway 

28) GenePriority 

29) GenePub 

30) GenotypeCode 
20 Ethnicity 

32) HapAssay 

33) HapCompoundAssay 

34) HapHistory 

35) Haplotype 

36) HapMethod 

37) HapPatent 

38) HapPub 

39) HapSNP 

30 40) HapSNPHistory 

41) LocationType 

42) MapType 

43) Method 

2^ 44) MoleculeType 
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45) Nomenclature 

46) Patent 

47) Patentlmage 

48) Pathway 

5 49) PathwayPub 

50) PolyMethod 

51) Polymorphism 

52) PolyNameAlias 

53) PolySeq3 

10 

54) PolySeqS 

55) Publication 

56) SeqAccession 

57) SeqFeatureLocation 
15 58) SeqGeneLocation 

59) SeqSeqLocation 

60) SequenceText 

61) SNPAssay 
2Q 62) SNPPatent 

63) SNPPub 

64) Species 

65) Patient 

66) PatientCohoit 

67) PatientEthnicity 

68) PatientHap 

69) PatientHiapClinOutcome 

70) PatientHapHistory 

30 "^0 PatientMedicalHistory 

72) PatientSNP 

73) PatientSNPHistory 

74) TherapetuicArea 

75) TherapeuticGene 
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76) VariationType 

Additional tables (not shown) may include Allele, 
FeatureMapLocation, Publmage, TherapCompound 

d. Fields 

Figures 25 A-E show the fields of each table in the database. 
The following are descriptions of the fields found in the database as well as for 
fields and tables that could be added to the database: 



10 



table 

Accession 



Name 



Null? 



15 



20 



Type 



ACCESSION NOT NULL VARCHAR2(20) 



SOURCE 

DESCR 

INSERTED_BY 

INSERTJHME 

UPDATED_BY 

UPDATE TIME 



VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Comments 



a unique ID for a 
sequence in the 
commonly used public 
domain databases; 
becomes de facto 
standard for sequence 
data access in the 
academia and industry 
who issued the ID 
other descriptions 
who inserted the record 
when 

who updated the record 
when 



fable 
Allele 



Name 



Null? 



Type 



25 



30 



ALLELE^NAME NOT NULL NUMBER(4) 



POLYJD 

nt_seo_text 
aa_seq3ext 

DESCR 
INSERTED^BY 
INSgRTjriME 
UPDATED^BY 
UPDATE TIME 



NOT NULL NUMBER 

VARCHAR2(4000) 

VARCHAR2(1000) 

VARCHAR2(200) 
VARCHAR2(30) 

DATE 
VARCHAR2{30) 

DATE 



allele is the one member 
of a pair or series of 
genes that occupy a 
specific position on a 
specific chromosome 
Foreign key to the 
polymorphism record 
Nucleotide sequence 
string 

Amino acid sequence 



35 
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table 
Assay 



Name 



Null? 



Type 



ASSAY ID 



NOT NULL NUMBER 



ASSAY_NAME 

ASSAY^PARAMETERS 

DESCR 

INSERTED.BY 
INSERT^TIME 
UPDATED_BY 
UPDATE TIME 



VARCHAR2(50) 
VARCHAR2(200) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



Primary key for the 
assay table 



Name 



Null? 



Assay Result 



Type 



10 



15 



ASSAY_ID 

ASSAYJTYPE 

MEASURE 

TIMESTAMP 
OPERATOR 
DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE TIME 



NOT NULL NUMBER 

VARCHAR2(100) 
VARCHAR2{200) 

DATE 

VARCHAR2(50) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



measurement of the 
assay parameters 
time of operation 
who did it 



table 

BioSequence 



Name 



20 



25 



SEQJD 
MOLJTYPE 
SEQ^LENGTH 
PATENTJD 

DESCR 

INSERTED_BY 
INSERT jriME 
UPDATED_BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 
VARCHAR2(20) 
NUMBER 
NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



sequence ID (PK) 
molecular type 
sequence length 
FK to the patent record 



table 

Chromosome 
Map 



30 



35 



Name 



MAPJD 

MAPJTYPEJD 

SPECIESJD 

CHROMOSOME 

MAP^NAME 

EXTERNAL^KEY 

KEY^SOURCE 
DESCR 

INSERTED BY 



Null? 



NOT NULL 
NOT NULL 
NOT NULL 



Type 



NUMBER(4) 

NUMBER(4) 

NUMBER 

VARCHAR2(2) 

VARCHAR2(50) 

VARCHAR2{50) 

VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 



unique genetic map ID 
FK to MapType 
FK to species 



ID used by external 

sources 

which source 
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INSERTJ"IME 
UPDATED^BY 
UPDATE TIME 



DATE 

VARCHAR2(30) 
DATE 



table 

ClasperClone 



Name 



Null? 



10 



fable 

ClinlcalSite 



15 



Type 



CLASPER^CLONEJD NOT NULL NUMBER 

PI VARCHAR2(50) 



DESCR 

INSERTED3Y 
INSERT^TIME 
UPDATED^BY 
UPDATE.TIME 

Name 



Null? 



VARCHAR2(200) 
VARCHAR2(30} 
DATE 

VARCHAR2(30) 
DATE 

Type 



CLINICAL_SITEJD NOT NULL NUMBER(4) 



SITE_NAME 

COMPANYJD 

DESCR 

INSERTED_BY 
INSERT^TIME 
UPDATED^BY 
UPDATE TIME 



VARCHAR2(50) 
NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Unique ID for each 
Clasper done 
Subject ID; it is the FKto 
Subject table 



20 



table 
Company 



Name 



Null? 



Type 



25 



table 

Company 

Address 



30 



35 



COMPANYJD 
COMPANY_NAME 

DESCR 

fNSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATEjriME 

Name 



COMPANYJD 

CONTACTJD 

STREET 

CITY 

STATE 

COUNTRY 

ZIP 

WEB^SITE 
DESCR 

INSERTED_BY 
INSERT TIME 



NOT NULL NUMBER 

VARCHAR2(50) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 

NUMBER 

VARCHAR2(50) 

VARCHAR2(50) 

VARCHAR2(50) 
VARCHAR2(100) 

VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2{200) 

VARCHAR2(30) 

DATE 
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UPDATED_BY 
UPDATE TIME 



VARCHAR2(30) 
DATE 



table 

Compound 



Name 



Null? 



10 



COMPOUNDJD NOT NULL 
COMPANYJD 
THERAPJD 
PATENTJD 
REGISTRATION NUM 



COMPOUND_NAME 

DESCR 

INSERTED^BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



Type 

NUMBER 
NUMBER 

NUMBER 
NUMBER 

VARCHAR2(50) 



VARCHAR2(200) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



Compound registration 
number is generally the 
unique ID for the 
compound in that 
company 



15 



fable 

Compound 
Assay 



20 



Name 



Null? 



ASSAYJD 

DESCR 

INSERTED_BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



Type 



COMPOUNDJD NOT NULL NUMBER 



NOT NULL NUMBER 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



table 
Contact 



Name 



Null? 



Type 



25 



30 



35 



CONTACTJD 

COMPANYJD 

ADDRESSJD 

LAST^NAME 

MIDDLE^NAME 

FIRST_NAME 

OFFICE^PHONE 

EMAIL 

CELL-PHONE 

PAGER^PHONE 

FAX 

WEB^SITE 
DESCR 

INSERTED^BY 
INSERT_TIME 
UPDATED^BY 
UPDATE TIME 



NOT NULL NUMBER 
NOT NULL NUMBER 
NUMBER 
VARCHAR2(50) 
VARCHAR2(20) 
VARCHAR2(50) 
VARCHAR2(20) 
VARCHAR2(100) 
VARCHAR2(20) 
VARCHAR2(20) 
VARCHAR2{20) 
VARCHAR2(200) 
VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 
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fable 

FamilyMember 



Name 



Null? 



Type 



PI NOT NULL VARCHAR2(50) 

FAMILY^POSITION NOT NULL VARCHAR2(20) 



DESCR 

INSERTED^BY 

INSERT_TIME 

UPbATED_BY 

UPDATE^TIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



FK to Patient 
examples are sibblings, 
parents, grandparents, 
etc. 



10 



table 

FamilyMember 
Ethnicity 



15 



Name 



Null? 



Type 



PI NOT NULL VARCHAR2(50) 

FAMILYJ>OSITION NOT NULL VARCHAR2(20) 
ETHNIC_CODE NOT NULL VARCHAR2(20) 



DESCR 
INSERTED^BY 
INSERT^TIME 
UPDATED^BY 
UPDATE TIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



FK pointing to the 
Ethnicity table 



table Feature 



20 



25 



30 




Null? 



GENE ID 



Type 



NOT NULL NUMBER 



NUMBER 



FEATURE^NAME VARCHAR2{50) 
FEATURE.KEYJD NOT NULL NUMBER(3) 



MAP^ID 

DESCR 

INSERTED_BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



a feature is defined as 
either a genomic 
stmcture of a gene, or a 
fragment of DNA on a 
chromosome in the 
genome. 

FK pointing to the Gene 
table in case of feature 
of a gene 

FK pointing to the 
FeatureKey table to 
allow only validated 
feature types 



table 

Feature 

Accession 



Name 



35 



ACCESSION 
FEATURE ID 



Null? 



NOT NULL 
NOT NULL 



Type 



VARCHAR2(20) 
NUMBER 
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START POS 



END^POS 
DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED__BY 
UPDATE TIME 



NUMBER 



NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



the start position of the 

feature in the sequence 

identified by that 

accession 

the end position 



table 

Feature 

GeneLocation 



Name 



Null? 



Type 



10 



15 



20 



GENEJD 
LOC TYPE 



FEATUREJD 
LOC_VALUE 

RANGE_FROM 

RANGE^TO 
DESCR 

INSERTED_BY 
INSERT^TIME 
UPDATED^BY 
UPDATE TIME 



NOT NULL NUMBER 
NOT NULL VARCHAR2(20) 



NOT NULL 



NUMBER 
NUMBER 

NUMBER 

NUMBER 

VARGHAR2(200) 
VARCHAR2(30) 
DATE 

VARGHAR2{30) 
DATE 



FK 

k>cation type determines 
what type of structural 
relationship we are going 
to build in the particular 
case between the gene 
and the feature 
FK 

if the location type 
requires only one value. , 
here it goes 
if the location type is a 
range, then this is the 
start position 
and this is the end 
position 



table 

Featurelnfo 



Name 



25 



30 



FEATUREJD 
QUALIFIER 

DETAIL^VALUE 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATib_BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 
VARCHAR2(50) 

VARCHAR2(2000) 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



a free set of annotations 
to a feature 

the values of the qualifier 
annotation 



table 

FeatureKey 



35 



Name 



Null? 



Type 



FEATURE_KEYJD NOT NULL NUMBER(3> 
FEATURE_KEY VARCHAR2(20) 



SOURCE 



VARCHAR2(20) 



feature key validates the 
f ature types allowed 
who defined the key 
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DESCR 

INSERTED^BY 
INSERT_TIME 
UPDATED.BY 
UPDATE TIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



table 

FeatureList 



Name 



Null? 



Type 



10 



FEATUREJD 
ITEM ID 



DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED^BY 
UPDATE TIME 



NOT NULL 
NOT NULL 



NUMBER 
NUMBER 



VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



PK1 

PK2. This structure Is 
used to build the 
relationship lietween 2 
features 



table 

FeatureMap 
Location 



Name 



15 



20 



FEATUREJD 
MAPJD 

MAP^LOCATION 
DESCR 

INSERTED^BY 
INSERT^TIME 
UPDATED_BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 

NUMBER(4) 

NUMBER 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



gene or genome map 
location of the feature 



table 

FeaurePub 



25 



30 



Name 

PUBJD 
FEATURE ID 



DESCR 

INSERTED^BY 
INSERT^TIME 
UPDATED_.BY 
UPDATE TIME 



Null? 

NOT NULL 
NOT NULL 



Type 

NUMBER 
NUMBER 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



publication ID is the PK 
&FK 

so Is the feature ID. This 
table builds the many-to- 
many relationship 
between the tables of 
Publication and Feature 



table 
Gene 



35 



Name 



GENE ID 



Null? 



Type 



NOT NULL NUMBER 



unique ID for a gene 
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GENE SYMBOL NOT NULL 



GENE FAMILY ID NUMBER 



SPECIESJD 

PATENTJD 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATE TIME 



NOT NULL 



VARCHAR2(20) 



NUMBER 

NUMBER 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



standardized gene 
symbols used in the 
most simplistic manner 
to refer to a gene 
the family cluster a gene 
belongs to 

the species which has 
this gene 

the patent associated 
with this gene 
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table 

GeneAccesslon 



Name 



15 



GENEJD 
ACCESSION 

DESCR 
INSERTED^BY 
INSERT_TIME 
UPDATED_BY 
UPDATE TIME 



Null? 

NOT NULL 
NOT NULL 



Type 



NUMBER 
VARCHAR2(20) 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



gene and the sequence 
association through the 
unique accession 



table 

GeneAlias 



20 



25 



Name 



Null? 



GENEJD 
ALIAS NAME 



DESCR 
INSERTED^BY 
INSERT^TIME 
UPDATED.BY 
UPDATE TIME 



Type 



NOT NULL NUMBER 
NOT NULL VARCHAR2(500) 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



table to handle the 
various alias names for a 
gene 



table 

GeneFamily 



30 



Name 



Null? 



GENE^FAMILYJD NOT NULL 

FAMILY_NAME 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATE TIME 



Type 



NUMBER(4) 
VARCHAR2(60) 
VARCHAR2(200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 
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table 

GeneMap 

Location 



Name 



GENEJD 
MAPJD 

MAP^LOCATION 
DESCR 
INSERTED_BY 
INSERT jriME 
UPDATED_BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 
NUMBER(4) 
NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



genome map location 



table 

GenePathway 



10 



15 



Name 



Null? 



GENEJD 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE TIME 



Type 



PATHWAYJD NOT NULL NUMBER(4) 



NOT NULL NUMBER 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the biological pathway in 
which the gene plays a 
role 



table 

GenePriority 



20 



25 



Name 



Null? 



Type 



GENEJD NOT NULL NUMBER 

TASK.FORCE_NUM NUMBER(6) 

REX_PRIORIT/ VARCHAR2(5) 
NEW^PRIORITY VARCHAR2(5) 
REALM_PRIORITY VARCHAR2(5) 



DESCR 
INSERTED^BY 
INSERT jriME 
UPDATED_BY 
UPDATE TIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



intemal info for gene 
project prioritization 



table 
GenePub 



30 



Name 



PUBJD 

GENEJD 

DESCR 

INSERTED^BY 

INSERT.TIME 

UPDATED^BY 

UPDATE TIME 



Null? 



NOT NULL 



Type 



NOT NULL NUMBER 



NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



publications conceming 
a gene 



35 
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table Name 
GenotypeCode 



GENOTYPE 

DESCR 
INSERTED_BY 
INSERT^TIME 
UPDATED^BY 
UPDATE TIME 



Null? 



Type 



NOT NULL CHAR(1) 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



genotyping code for the 
polymorphisin 



table 
Ethnicity 



10 



15 



Name 



ETHNIC GROUP 



Null? 



Type 



VARCHAR2(20) 



ETHNIC^CODE NOT NULL VARCHAR2(20) 



ETHNIC_NAME 

DESCR 
INSERTED_BY 
INSERT^TIME 
UPDATED_BY 
UPDATE TIME 



VARCHAR2(100) 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the major ethnic groups 
such as Caucasian, 
Asian, etc. 
the Ethnic code that 
specifies the detailed 
geographical and ethnic 
background of the 
subject (patient, or 
genetic sample donor) 
the name description of 
the code 



20 



table 

HapAeeay 



25 



Name 



HAPJD 

ASSAYJD 
DESCR 

INSERTED^BY 
INSERT_TIME 
UPDATED^BY 
UPDATE TIME 



Null? 



NOT NULL 



Type 



NOT NULL NUMBER 



. NUMBER 
VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



unique ID for the 
haplotype 



table 

Ha pCom pound 
Assay 



30 



35 



Name 



HAP ID 



COMPOUNDJD 

ASSAYJD 

DESCR 

INSERTED_BY 
INSERT^TIME 
UPDATED BY 



Null? 



NOT NULL 
NOT NULL 



Type 



NOT NULL NUMBER 



NUMBER 
NUMBER 
VARCHAR2(200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 



association table where 
the haplotype of a gene 
and a compound meet in 
a specific assay 
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table 

HapHistory 



10 



UPDATE^TrME 

Name Null? 



HAPJD 
GENEJD 

CREATE^TIMESTAMP 

HAP^NAME 

HISTORY^TIMESTAMP 

ORIGINAL^DESCR 

HISTORY.DESCR 

INSERTED_BY 

INSERT^TIME 

UPDATED_BY 

UPDATE TIME 



DATE 
Type 



HAP HISTORY ID NOT NULL NUMBER 



NUMBER 
NUMBER 
DATE 

VARCHAR2(50) 

DATE 
VARCHAR2(200) 
VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



history table to keep 
track of the knowledge 
progress concerning a 
haplotype 



when created 
when put into history 



15 



table 

Haplotype 



20 



Name 



HAPJD 

GENEJD 

TIMESTAMP 

HAP_NAME 

DESCR 

INSERTED^BY 
INSERT_TIME 
UPDATEDJY 
UPDATE TIME 



Null? 



Type 



NOT NULL NUMBER 
NUMBER 
DATE 

VARCHAR2(50) 
VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



table 

HapMethod 



Name 



25 



30 



table 

HapPatent 



35 



HAPJD 
METHODJD 

DESCR 
INSERTED_BY 
INSERT^TIME 
UPDATED_BY 
UPDATE TIME 



Name 



HAPJD 
PATENT^ID 

DESCR 
INSERTED^BY 
INSERT TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



Null? 

NOT NULL 
NOT NULL 



NUMBER 
NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 

Type 



NUMBER 
NUMBER 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 



method used in 
haplotyping 



patent relates to a 
haplotype 
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UPDATED^BY 
UPDATE TIME 



VARCHAR2(30) 
DATE 



table 
HapPub 



Name 



PUBJD 

HAPJD 

DESCR 

INSERTED_BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



Null? 



Type 



NOT NULL NUMBER 



NOT NULL 



NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



publicatton relates to a 
haplotype 



10 



table 
HapSNP 



15 



Name 



HAPJD 
POLYJD 

TIMESTAMP 

DESCR 

INSERTED^BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 
NUMBER 



DATE 
VARCHAR2(20O) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



haplotype consists of 
SNPs 



table 

HapSNPHistory 



Name 



Null? 



20 



25 



Type 



HAP_SNP„HISTORYJD NOT NULL NUMBER(4) 



HAPJD NOT NULL 

POLYJD NOT NULL 

CREATEjriMESTAMP 
HISTORY^TIMESTAMP 
ORIGINAL^DESCR 
HISTORY_DESCR 
INSERTED_BY 
INSERT_TIME 
UPDATED^BY 
UPDATE TIME 



NUMBER 
NUMBER 
DATE 
DATE 
VARCHAR2(200) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



history about the 
progress of the SNPs 
that are used in a 
haplotype construction 



30 



table Name 
LocationType 



35 



LOGOTYPE 

DESCR 
INSERTED_BY 
INSERT^TIME 
UPDATED BY 



Null? 



Type 



NOT NULL VARCHAR2(20) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 



location type for the 
various genetic objects 
in the genome 
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table 
MapType 



UPDATE„TIME 
Name 



Null? 



MAP^TYPE 

DESCR 

INSERTED_BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



DATE 
Type 



MAP„TYPEJD NOT NULL NUMBER(4) 



VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



validation tool for the 
possible types of 
genome maps 



10 



table 
Method 



Name 



Null? 



15 



METHODJD 
METHOD 

PROTOCOL 

DESCR 
INSERTED^BY 
INSERT^TIME 
UPDATED.BY 
UPDATE TIME 



Type 



NOT NULL NUMBER 
NOT NULL VARCHAR2(60) 

VARCHAR2(2000) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



the lab experimental 
method 

the detailed protocol for 
a method 



table 

MoleculeType 



Name 



Null? 



Type 



20 



25 



table 

Nomenclature 



30 



MOLJTYPE 

DESCR 

INSERTED^BY 

INSERT_TIME 

UPDATED.BY 

UPDATEjriME 

Name 



GENE_SYMBOL 
GENE NAME 



Null? 



NOT NULL 



35 



SOURCE 

CYTO LOCATION 



GDB ID 



NOT NULL VARCHAR2(20) 

VARCHAR2{200) 
. VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



VARCHAR2(20) 
VARCHAR2(500) 



VARCHAR2(20) 
VARCHAR2(60) 



VARCHAR2(50) 



rfiolecular type for which 
a sequence is known 



used to standardize the 
naming of a gene. 
HUGO official name 
takes precedence in the 
naming scheme 

cytogenetic location of a 
gene; this is the best 
way to map various gene 
names onto a single 
gene 

ID by other public data 
source 
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DESCR 

INSERTED_BY 

INSERT,TIME 

UPDATED^BY 

UPDATE^TIME 



VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2{30) 
DATE 



table 
Patent 



Name 



Null? 



Type 



10 



15 



table 

Patentlmage 



20 



PATENTJD 
PATENT_TYPE 

COMPANYJD 

INVENTORS 

ABSTRACT 

INSTITUTION 

CLAIMS 

TITLE 

DESCR 

INSERTED^BY 

INSERTjriME 

UPDATED_BY 

UPDATEjriME 

Name 



PATENTJD . 
PDFFILE 

DESCR 
iNSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE TIME 



NOT NULL NUMBER 

VARCHAR2(20) 

NUMBER 
VARCHAR2(200) 
VARCHAR2(1000) 
VARCHAR2(200) 
VARCHAR2(4000) 

VARCHAR2(200) 
VARCHAR2(200) 
VARGHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



Null? 



NOT NULL 



Type 



NUMBER 
BLOB 



VARCHAR2(20) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



patent type can be 
Issued, pending, etc. 



the claims of the patent 



the multi-media image 
file of the patent 



25 



table 
Pathway 



30 



table 

PathwayPub 



35 



Name 



Null? 



PATHWAY^NAME 
DESCR 
INSERTED^BY 
INSERTjriME 
UPDATED_BY 
UPDATEjriME 

Name 



PATHWAYJD 
PUBJD 

DESCR 
INSERTED BY 



Null? 

NOT NULL 
NOT NULL 



Type 



PATHWAYJD NOT NULL NUMBER{4) 



VARCHAR2{50) 
VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 

Type 



NUMBER(4) 
NUMBER 

VARCHAR2(200) 
VARCHAR2(30) 



biological pathways 



publications concerning 
a pathway 
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INSERT_TIME 
UPDATED_BY 
UPDATE TIME 



DATE 

VARCHAR2(30) 
DATE 



table 

Poly Method 



Name 



10 



table 

Polymorphisin 



15 



20 



25 



30 



POLYJD 

METHODJD 

DESCR 

INSERTED_BY 

INSERT.TIME 

UPDATED_BY 

UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



Name 



POLYJD 
FEATURE ID 



Null? 



NOT NULL 
NOT NULL 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 

Type 



NUMBER 
NUMBER 



VARIATION JTYPE NOT NULL VARCHAR2(3) 
POLY^CONSEQUENCE VARCHAR2(200) 



SYSTEM_NAME 
START^POS 

END^POS 
LENGTH 

PRIMER ID 



SAMPLE SIZE 



QC 

DESCR 
INSERTED^BY 
INSERT^TIME 
UPDATED^Y 
UPDATE TIME 



VARCHAR2(50) 
NUMBER 

NUMBER 
NUMBER 

VARCHAR2(50) 



NUMBER 



VARCHAR2(20) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



method used In 
discovering a 
polymorphism 



PK for a polymorphism 

where the polymorphism 

occurs in a genetic 

feature 

what type of 

polymorphism 

the consequence or 

mechanism of the 

polymorphism 

the systematic name for 

the polymorphism 

starting position of the 

polymorphism in the 

feature 

ending position 
length of the changing 
structure 

FK to a table in another 
in-house database 
where the primers used 
in the polymorphism 
discovery was kept 
the number of subject 
being used in the 
discovery of the 
polymorphism 
quality control 
information 



table Name 
PolyNameAlias 



35 



Null? 



POLY ID 



Type 



NOT NULL NUMBER 
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NAME_AL1AS 

EXTERNAL^KEY 

KEY^SOURCE 
DESCR 

INSERTED^BY 
INSERTTIME 
UPDATED_BY 
UPDATE TIME 



VARCHAR2(50) 

VARCHAR2(50) 

VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



other names for the 

polymorphism 

unique ID by other data 

sources 



table 
PolySeq3 



Name 



10 



15 



fable 
PolySeqS 



20 



POLYJD 
SEQ^TEXT 

DESCR 

INSERTED„BY 
INSERT^TIME 
UPDATED^BY 
UPDATEjriME 

Name 



POLYJD 

SEQ^TEXT 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED^BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



Null? 



NOT NULL 
NOT NULL 



NUMBER 
VARCHAR2(250) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 

Type 



NUMBER 

VARCHAR2(250) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



the 3* DNA sequence 
that flanks the 
polymorphic site 



sequence siring of this 
piece of DNA 



the 5' DNA sequence 
that flanks the 
polymorphic site 



table 
Publmage 



25 



30 



table 

Publication 



35 



Nanne 



PUBJD 
PDFFILE 

DESCR 

INSERTED^BY 
INSERT^TIME 
UPDATED_BY 
UPDATE^TIME 

Name 



PUBJD 

AUTHORS 

TITLE 

INSTITUTION 
SOURCE 



Null? 



NOT NULL 



Null? 



NOT NULL 



Type 



NUMBER 
BLOB 



VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 

Type 



NUMBER 

VARCHAR2{200) 

VARCHAR2(600) 

VARCHAR2(200) 

VARCHAR2(200) 



image file of the 
publication 



PK for a publication 
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KEYWORDS 

ABSTRACT 

EXTERNAL^KEY 

KEY_SOURCE 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATE TIME 



VARCHAR2(500) 
VARCHAR2(4000) 
VARCHAR2(50) 
VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



table 

SeqAecession 



Name 



10 



15 



SEQJD 
ACCESSION 

VERSION 
Gl 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED.BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 
VARCHAR2(20) 

NUMBER 
NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



PK for sequence 
unique ID from the public 
sequence databases 
version of the sequence 
gene ID issues by NCBI 
national database 



table 

SeqFeature 
Location 



Name 



20 



25 



LOC^TYPE 

SEQJD 

FEATUREJD 

LOC_VALUE 

RANGE^FROM 

RANGE^TO 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED^BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 
NOT NULL 



Type 



VARCHAR2(20) 

NUMBER 

NUMBER 

NUMBER 

NUMBER 

NUMBER 

VARCHAR2{200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



sequence and feature 
location relationship 



table 

SeqGene 

Location 



30 



35 



Name 



GENEJD 

LOC^TYPE 

SEQJD 

LOC^VALUE 

RANGE^FROM 

RANGE_T0 

DESCR 

INSERTED^BY 

INSERTjriME 

UPDATED BY 



Null? 



NOT NULL 
NOT NULL 
NOT NULL 



Type 



NUMBER. 

VARCHAR2(20) 

NUMBER 

NUMBER 

NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 



sequence and gene 
location relationship 
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table 

SeqSeq 

Location 



10 



UPDATE_TIME 
Name 



LOGOTYPE 

SEQJD 

ITEMJD 

LOC_VALUE 

RANGE_FROM 

RANGE_TO 

DESCR 

INSERTED_BY 

INSERT^TIME 

UPDATED_BY 

UPDATEjriME 



Null? 



NOT NULL 
NOT NULL 
NOT NULL 



DATE 
Type 



VARCHAR2(20) 

NUMBER 

NUMBER 

NUMBER 

NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



sequence and sequence 
location relationship 



table 

SequenceText 



Name 



Null? 



Type 



15 



20 



25 



table 

SNPAssay 



30 



SEQJD NOT NULL NUMBER 

SMALL_SEQ_TEXT VARCHAR2(4000) 



LARGE^SEQ^TEXT 



DESCR 

INSERTED_BY 

INSERTjriME 

UPDATED_BY 

UPDATE_TIME 

Name 



POLYJD 

ASSAY JD 

DESCR 

INSERTED^BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



LONG 



VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 

Type 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



the actual sequence text 
in a string of characters 



if the sequence is less 
than 4000 characters, it 
is stored in this field 
if larger than 4K, stored 
as a LONG datatype in 
this field which has much 
limitation in terms of 
processing capacities by 
the DBMS. This division 
is caused by the fact that 
a Oracle VARCHAR2 
data type can store only 
4000 characters. 



polymorphism In an 
assay 



table 

SNPPatent 



35 



Name 



POLYJD 



Null? 



Type 



NOT NULL NUMBER 



polymorphism related 
patent 
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PATENT J D 
DESCR 

INSERTED.BY 
INSERTjriME 
UPDATED_BY 
UPDATE TIME 



NOT NULL 



NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



table 
SNPPub 



Name 



10 



PUBJD 

POLYJD 

DESCR 

INSERTED_BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



a polymorphism related 
publications 



table 
Species 



Name 



15 



20 



table 
Patient 



25 



30 



SPECIESJD 
SYSTEM^NAME 

COMMON.NAME 

DESCR 

INSERTED^BY 

INSERT.TIME 

UPDATED_BY 

UPDATE jriME 

Name 



Null? 



NOT NULL 



Type 



Null? 



CLINICAL^SITEJD NOT NULL 
Pj NOT NULL 

GENDER 
YOB 

FAMILYJD 
FAMILY POSITION 



EXTERNAL.KEY 

KEY^SOURCE 

DESCR 

INSERTED_BY 

INSERT^TIME 

UPDATED_BY 

UPDATE TIME 



NUMBER 
VARGHAR2(50) 

VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 

Type 



NUMBER(4) 
VARCHAR2(50) 

CHAR(I) 
DATE 

VARCHAR2(20) 
VARCHAR2(20) 

VARCHAR2(20) 

VARCHAR2(20) 
VARCHAR2{200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 



a biological species 



its scientific systematic 
name 

its common name 



patient ID as the unique 
identifier for a person 

year of birth 
family ID if known 
the generation 
information In a family 
based genetic study 
the ID used by other 
sources 



35 



table 

PatientCohort 



Name 



PROJECT ID 



Null? 



Type 



NOT NULL NUMBER 



the patient set used in a 
particular project 
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PI 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATE TIME 



NOT NULL 



VARCHAR2(60) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



table 

PatientEthnicity 



Name 



10 



PI 

ETHNIC_CODE 
DESCR 

INSERTED_BY 
INSERT_TIME 
UPDAtED^BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



VARCHAR2(60) 

VARCHAR2(20) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



Ethnic background of a 
person 



table 

PatlentHap 



Name 



15 



20 



table 

PatlentHapClln 
Outcome 



25 



PI 

HAPJD 
QC 

TIMESTAMP 
DESCR 

INSERTED^BY 
INSERT^TIME 
UPDATED„BY 
UPDATE TIME 



Name 



Null? 



NOT NULL 
NOT NULL 



Type 



Null? 



SI 

HAPJD 
CL1NJTEST_NAME 
CLIN_TEST_RESULT 
DESCR 

INSERTED^BY 
INSERT_TIME 
UPDATED_BY 
UPDATE TIME 



NOT NULL 
NOT NULL 



VARCHAR2(50) 
NUMBER 
VARCHAR2(20) 
DATE 

VARCHAR2(200) 

VARGHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 

Type 



VARCHAR2(50) 

NUMBER 

VARCHAR2{50) 

VARCHAR2(20) 

VARCHAR2{200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



Haplotyping infiormation 
of a person 



the clinical measurement 
against a particular 
haplolype In a person 



30 



table 

SubJectHap 
History 



35 



Name 



Null? 



Type 



S_HAP_HISTORYJD NOT NULL NUMBER 
HAPJD NUMBER 
QC VARCHAR2(20) 
SI VARCHAR2(50) 
CREATE_TIMESTAMP DATE 



history record of the 
haplotype information for 
a subject 



SUBSTITUTE SHEET (RULE 26) 



wo 01/01218 PCT/USOO/17540 



-97- 



HISTORY_TIMESTAMP 

ORIGINAL^DESCR 

HISTORY^OESCR 

INSERTED^BY 

INSERTjriME 

UPDATED^BY 

UPDATE TIME 



DATE 

VARCHAR2(200) 
VARCHAR2(200) 
VARCHAR2{30) 
DATE 

VARCHAR2{30) 
DATE 



table Name 

SubjectMedlcal 

History 



10 



St 

THERAPJD 

DESCR 
INSERTED^BY 
INSERTjriME 
UPDATED^BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



VARCHAR2(50) 
NUMBER 



VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



medical conditions of a 
subject when the genetic 
sample is collected 



FK pointing to a 
therapeutic area which 
maps to a disease 



15 



table 

SubJectSNP 



Name 



Null? 



Type 



20 



25 



SI 

POLYJD 
GENOTYPE 



HAPJD 
QC 

TIMESTAMP 

DESCR 

INSERTED^BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



NOT NULL VARCHAR2(50) 
NOT NULL NUMBER 
NOT NULL CHAR(1) 



NUMBER 



VARCHAR2(20) 
DATE 
. VARCHAR2{200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 



the genotyping 
information of a person 
at a given polymorphic 
site 

the polymorphism may 
be a part of a haplotype 



table 

SubjectSNP 
History 



Name 



Null? 



30 



35 



Type 



S SNP HISTORY ID NOT NULL NUMBER 



SI 

POLYJD 

HAPJD 

GENOTYPE 

CREATE.TIMESTAMP 

QC 

HISTORY J-IMESTAMP 
ORIGINAL^^DESCR 
HISTORY^DESCR 
INSERTED BY 



VARCHAR2(50) 

NUMBER 

NUMBER 

CHAR{1) 

DATE 

VARCHAR2(20) 
DATE 

VARCHAR2(200) 
VARCHAR2{200) 
VARCHAR2(30) 



history record for a 
polymorphism in a 
person 
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INSERT^TIME 
UPDATED^BY 
UPDATE TIME 



DATE 

VARCHAR2(30) 
DATE 



table 

Therap 

Compound 



Name 



10 



table 

Therapeutic 
Area 



15 



COMPOUNDJD 

THERAPJD 

DESCR 

INSERTED^BY 

INSERTjriME 

UPDATED_BY 

UPDATE_TIME 

Name 



THERAP^AREA 

THERAPJD 

RELATED_AREA 

DESCR 
INSERTED_BY 
INSERT^TIME 
UPDATED_BY 
UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



Null? 



NOT NULL 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 

Type 



VARCHAR2(50) 

NUMBER 

NUMBER(4) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2{30) 
DATE 



a compound used in tlie 
treatment of a disease 



the disease name 
its relation to other 



20 



table 

Therapeutic 
Gene 



Name 



25 



GENEJD 

THERAPJD 

DESCR 

INSERTED^BY 

INSERT^TIME 

UPDATED^BY 

UPDATE TIME 



Null? 



NOT NULL 
NOT NULL 



Type 



NUMBER 

NUMBER 

VARCHAR2{200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



the target gene for a 



table 

VariationType 



30 



Name 



Null? 



Type 



VARIATION_TYPE NOT NULL VARCHAR2(3) 



DESCR 
INSERTEDlBY 
INSERT J-IME 
UPDATED^BY 
UPDATE TIME 



VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



the validated types of 
polymorphism 



35 
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o 

With reference to Figures 25 A-E, and as is apparent to one of 
skill in the art, rectangular boxes represent parent tables in the database, while 
rounded boxes represent children tables that depend on their parent tables. This 
dependency requires that a parent record be in existence before a child record can be 
5 created. Within the tables the primary keys are shown at the top and are partitioned 
off from the other fields by a line. Repeat instances of primary keys are indicated 
by "(FK)" meaning foreign key. 

FIG. 25F describes the relational symbols used in FIGS. 25A- 
E. A relational symbol such as indicated by reference numeml 2 represents an 

10 

identifying parent/child relationship. It depicts the not nullable l-to~0-or-many 
relationship. Not nullable means that one cannot create a record in the child unless 
a corresponding record (indicated by the particular relating field) exists or is created 
in the parent. A relational symbol such as indicated by reference numeral 4 

15 represents a non-identifying parent/child relationship. It represents the nullable 
0-or-l -to-many relationship. A relational symbol such as indicated by reference 
numeral 6 represents an identifying parent/child relationship. It depicts the not 
nullable 1-to-l-or-many relationship. A relational symbol such as indicated by 

2Q reference 8 represents a non-identifying parent/child relationship. It represents the 
not nullable 1-to-l-or-many relationship. A relational symbol such as indicated by 
reference numeral 10 represents an identifying parent/child relationship. It depicts 
the not nullable 1-to-exact-l relationship. A relational symbol such as indicated by 
reference nimieral 12 represents a non-identifying parent/child relationship. It 

25 

represents the nullable 0-or- 1-to-exact-l relationship. A relational symbol such as 
indicated by reference numeral 14 represents a non-identifying parent/child 
relationship. It depicts the not nullable 0-or- 1 -to-many relationship. 

2. Database Model Version 2 

30 

A preferred embodiment of the database model of the 
invention contains S sub-models and 83 tables. This model is organized at three 
levels of detail: sub-model, table and fields of tables. 
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a. Submodels 

The five submodels of this preferred embodiment are 
depicted in FIGURES 44A-E and are described below. 

Genomic Repository (Fig. 44A): This submodel organizes 
5 genomic information by spatial relationships. The central element of the genomic 
repository submodel is the Genetic Feature object, which is an abstract template for 
any object having a nucleotide sequence that can be mapped to the nucleotide 
sequence of other objects by providing a start and stop position. Genetic objects 
jQ (also referred to herein as genetic features) that are organized by the genomic 

repository submodel include, but are not limited to, chromosomes, genomic regions, 
genes, gene regions, gene transcripts and polymorphisms. 

Some of these genetic objects contain nucleotide sequences 
identified in the public domain while others represent some derived final state of a 
calculation as described below for generating an assembly and gene structure. In 
object parlance, Genetic_Feature is the base class from which these other objects are 
extended from. In relational terms, the primary keys for each of these genetic 
objects are foreign keys to the primary key of the Genetic_Feature table. Each 
20 genetic feature is represented by a \mique Feature JD that is generated by the 

database management system's sequence generator. The principal properties of a 
genetic feature are start position, stop position and reference. The start and stop 
positions indicate the extent of that genetic feature relative to another given genetic 
feature, which is the reference and is represented by another unique Feature__ID 
generated by the database management system's sequence generator. The reference 
serves as the parent in this table by the self pointing foreign key of Ref_ID. The 
Feature_Type attribute gives the database model the possibility to determine what 
type of spatial relationship is legal among what types of genetic features at a given 
30 time in a given context. For example, the system will allow a gene to map on to a 
sequence assembly by defining the start and end position of the gene in the 
assembly. A gene region is mapped on to a gene through a similar mechanism. The 
mapping of the gene region onto the assembly will therefore be made possible 
35 through the transverse of links between the Seq_Assembly and Gene tables and 
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between the Gene and Gene Region tables. Similarly, a polymorphism is mapped 
on to a sequence that will be a building block for the assembly, which in turn 
determines the reference sequence for the gene being analyzed for genetic variation. 

This centralized organization of the positional relationships of 
various genetic features through one parent table is believed to be novel and offers 
significant advantages over known database designs by reducing the cost of 
maintaining the database and increasing the efficiency of querying the database. In 
addition, organization of genetic features by this novel relative positional 
referencing approach allows this information to readily be organized into genomic 
sequences, gene and gene transcript structiires and also into diagrams mapping 
genetic features to the assembled genomic and gene sequences. The design and use 
of the genomic repository submodel are described in more detail below. 

The most important genetic features are defined below, with 
15 the names of the tables containing information specific to each genetic feature 
indicated in parentheses if different. 

Genome: The ultimate root feature for all genetic features. 
Its reference link is always null, i.e. it is itself not mapped to anything. As long as 
2Q there is not a complete genomic sequence, there is little reason to actually have a 
table for this. 

Chromosome: The highest unit of contiguous genomic 
sequence. The reference for chromosomes would be the genome. Because there is 
no overlap between chromosomes, the genome is a disjoint assembly of all the 

25 

chromosomes, in a particular order, with gaps between all neighboring 
chromosomes. 

Assembly (Seq_Assembly): An assembly is defined as a set 
of one or more contigs, ordered in a certain way. In the absence of genome or 
30 chromosome features, the assembly will be the root of the genomic sequence 
mapping tree. Its reference is then null. 

Contig: A contiguous assembly of overlapping sequences 
that are ordered 5 ' to 3 '. A contig is preferably referenced to its assembly. 

Unordered Contig: A collection of contiguous sequences 
that are not ordered and may or may not have gaps between them. An unordered 



35 



SUBSTITUTE SHEET (RULE 26) 



wo 01/01218 PCT/US00/17S40 



-102- 

O 

contig, which is represented by an external accession number, is broken down and 
used in building the sequence assembly as a normal contig. 

Sequence (Genetic_Accession): A stretch of nucleotide 
sequence data. This data is represented by a unique accession number and a version 
5 number. Sequence data can include YACs, B AGs, Gene sequences and ESTs. 
Typically, the source of sequence data will be GenBank and other sequence 
databases, but any piece of sequence is allowed. A sequence is normally referenced 
to its contig. 

Gap: The gap is a zero length feature which indicates that 

10 

there is an unknown amount of additional sequence to be inserted at this point. It is 
merely an indication of lack of knowledge and has no physical coxmterpart. Gaps 
are usually referenced to the Assembly in which they separate the contigs. They 
would also be used with the genome as reference to separate the chromosomes. 

IS Gene: This defines the gene locus in terms of base pairs. 

The start and stop positions of the gene are not usually well defined. A gene starts 
somewhere between the end of the previous gene and the beginning of the first 
recognized promoter element. A gene ends somewhere between the end of the last 

2Q exon and the beginning of the next gene. In practice, including at least four kilobase 
pairs of promoter region are desirable. A gene is preferably referenced to an 
assembly. 

Gene Region: A particular region of the gene. Gene regions 
are classified according to their transcriptional or translational roles. For a gene 
sequence, there are promoters, introns and exons. In a transcribed sequence, 
different gene regions include S' and 3 ' untranslated regions (UTRs) as well as 
protein-coding regions. 

Polymorphism: A part of the genome that is polymorphic 
30 across different individuals in a population. The most common polymorphisms are 
SNPs, the length of which is one base pair. All polymorphisms are preferably 
referenced to the sequence with respect to which they were found. 

Primer: A short region of about 20 base pairs corresponding 
to an oligonucleotide for priming PGR reactions and/or primer extension reactions 
in a variety of polymorphism detection assays. Primers are preferably referenced to 
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the sequence they were designed from. 

Transcript: The result of a splice operation of the gene 
sequence. There can be several transcripts per gene, to indicate splice variants. The 
transcript is mapped to genetic features via the Splice table, but does not map to 
5 anything the conventional way, i.e., its reference is always null. The transcript starts 
another branch of positional mapping of genetic features related to protein 
sequences. 

While the above definitions sets forth the preferred reference for 
certain kinds of genetic features (such as polymorphisms should be referenced to 

10 

sequences), it is important to realize that the schema design allows the reference for 
any particular genetic feature to be flexible and the reference may be changed as 
circimistances warrant. Whenever the user asks for a start or stop position, he 
should ask 'Svhat is the position of X relative to Y'', rather than '"^at is the 

IS position of X", which is an ambiguous question. The correct question can be 

answered with a simple tree traversal routine. The answer will not depend on which 
genetic feature serves as the direct reference for X, 

All start and stop positions are preferably given in nucleotide 

2Q positions, even for protein features. This retains flie uniformity of the mapping 

scheme, and the translation to amino acid positions is trivial. The first position in a 
sequence has the position 1 . The stop position is one more than the position of the 
last base, such that length = abs( stop - start ), The stop position can be less than 
the start position, in which case a reverse complement needs to be taken on the 

25 

reference sequence to get the feature sequence. However, in another embodiment, a 
different physical map could be generated that would be expressed in something 
other than base pair positions, e.g. centimorgans. 

Another level of hierarchy could be added to the genomic repository 
30 submodel by implementing each gene region type as its own subclass extending the 
Gene_Region (i.e., creating separate tables for different gene region types with the 
primary key linked as foreign key to the Gene_Region table). Alternatively, the 
hierarchy could be flattened by eliminating the Gene_Regipn object and have 
individual gene region types directly subclassing Genetic_Feature. 
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In addition, other genetic features may be added as the database 
develops. For example, it is contemplated that an additional useful genetic feature 
is a secondary structure region of a protein, e.g., alpha-helix, beta-sheet, turn and 
coil regions. For each new genetic feature, a new genetic feature type needs to be 
created, and a table to contain information specific to the new genetic feature type 
needs to be added. Some genetic features will not have additional information 
(Gap, for example), and thus no table is necessary in such cases. The primary key 
of the genetic feature type specific table always needs to double as a foreign key to 
the Genetic_Feature table. This design enables the database submodel to be flexible 
and extendable enough to acconmiodate the rapid evolution and increase in volume 
of genomic information. 

Assembly of a genomic sequence typically starts with a gene name 
and comprises performance of the following steps by a human and/or computer 
operator: 

(a) Identify sequences related to this gene by searching GenBank 
and/or other sequence databases. 

(b) Generate contigs and alignments from the identified 
sequences using a conomercial sequence alignment program such as Phrap. 

(c) Store the assembly, contigs, and sequences as selected by the 
operator in the database (see Table A). 

The results of this process are one assembly miade up out of one or more 
contigs^ which in turn are made out of potentially many sequences. This is 
illustrated in the diagram shown in Figure 47 and Table A below. 

Table A 



Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 


1 


Assembly 


Assembly 








2 


Contig 1 


Contig 




1 


400 


3 


Gap 1 


Gap 




400 


400 


4 


Contig 2 


Contig 




400 


750 


5 


Gap 2 


Gap 




750 


750 


6 


Contig 3 


Contig 




750 


1000 


7 


A2345 


Sequence 


2 


1 


250 


8 


A3724 


Sequence 


2 


30 


180 


9 


M28384 


Sequence 


2 


100 


350 


10 


EST283729 


Sequence 


2 


300 


400 
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Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 


11 


A2445 


Sequence 


4 


1 


250 


12 


M24783 


Sequence 


4 


200 


350 


13 


M9485 


Sequence 


6 


1 


250 


14 


EST374886 


Sequence 


6 


80 


220 



15 



If there is more than one contig, the assembly will be disjoint, indicating that an 
unknown amount of sequence is missing in one or more places. Each such place is 
marked by a gap feature, which is referenced to the assembly feature. 

The assembly may be used in conjunction with additional 
information on the location of gene regions, i.e., promoters, exons and introns and 
the like, to generate a gene structure. Information on gene regions may be private or 
found in the public domain. Preferably, information on the gene regions is stored in 
the database and the gene structure is displayed to the user. An example of how 
such a display would typically appear is shown in Figure 48. The corresponding 
additions to Table A are shown in Table B below. 



Table B 



20 



25 



30 



Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 


15 


EXAMPLE 


Gene 


1 


120 


800 


16 


Promoter 


Gene Region 


15 


I 


180 


17 


Exon 1 


Gene Region 


15 


180 


280 


18 


Intron 1 


Gene Region 


15 


280 


500 


19 


Exon 2 


Gene Region 


15 


500 


680 
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The genomic repository database submodel of the present invention 
also allows referencing of gene transcripts to other genetic features. The 
relationship between a transcript and a genomic sequence is not a simple start/stop 
mapping, but requires the concatenation of separate regions of the genomic 
sequence into one combined sequence, the gene transcript. In the present submodel, 
this is represented by a Splice table, which provides an ordered list of splice 
elements (usually exon regions) for each splice product (usually a transcript). 
Although the splice product is a feature, it is not mapped to anything else, i.e. it is 
the root of its own mapping tree. Components of this tree can be 5' and 3' UTRs, a 
protein, and features related to that protein such as secondary structure or signal 
sequences. The diagram in Figure 49 shows the full mapping example down to the 
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protein regions. The Splice table for this example is set forth in Table C below, 
which incorporates the EXAMPLE information from Table B: 



Table C 



10 



15 



20 



25 



30 



Splice Id 


Order No 


Region Id 


Product Id 


1 


1 


17 


20 


1 


2 


19 


20 



Also, Table A would have the following additions: 



Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 


20 


EXAMPLE trans 


Transcript 








21 


5'UTR 


Region 


20 


1 


40 


22 


CETP prot 


Protein 


20 


40 


240 


23 


3'UTR 


Region 


20 


240 


280 
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2. Clinical Repository (FIGURE 44B): This submodel 
encapsulates polymorphism and clinical information about subjects and reference 
individuals used in clinical trials. The Subject JFIap table associates a given 
haplotype (identified by the field of Hap_Id) with each patient subject having that 
haplotype (identified by the field of Sub_ID (Subject ID)). Associations between 
polymorphisms in a locus (including SNPs and haploytpes ) and different clinical 
phenotypes (such as disease association and drug response) are captured by the 
MeasureJD and Measure_Result fields in the Subject Measurement table. 

3. Variation Repository (FIGURE 44C): This 
submodel covers the haplotypes and the polymorphisms associated with genes and 
patient cohorts used in clinical trial studies. Polymorphisms may include SNPs, 
small insertions/deletions, large insertions/deletions, repeats, frame shifts and 
alternative splicing. The Haplotype table has the basic fields of Hap_ID, 
Hap_Locus ID and Hap Name that identify a unique haplotype of a given gene or 
locus. A haplotype is further defined by the set of SNPs that it comprises, which are 
listed in the Hap SNP table. This association table uses data fields named Hap ID 
(haplotype ID) and Poly lD (polymorphism ID) to allow the mapping of the many- 
to-many relationship between haplotype and the polymorphism(s) that constitute the 
specific haplotype. The haplotype and SNP information may be used in clinical trial 
and drug assay studies. Data firom such studies are stored in the clinical repository 
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and drug repository submodels. 

4. Literature Repository (FIGURE 44D): This 
submodel enables amiotation of the genetic features in the genomic repository and 
the variation information in the variation repository with public domain information 
relating to these objects. Annotation information useful in the invention may be 
found in peer-reviewed scientific publications, patent documents, or by searching 
on-line electronic databases. The relationship between the annotated objects and 
their referencing information are linked through the various association tables. 

5. Drug Repository (FIGURE 44E): This submodel 
captures client companies, contact information, compounds used in difTerent disease 
areas and assay results for such compounds in regards to polymorphisms and 
haplotypes of target genes. Associations between polymoxphisms in a drug target 
and activity of a candidate drug are captured by the following data fields: Hap_ID 

15 (Hap_Locus table); Compound_ID (Compound table), and the Assay_ID (Assay, 
Assay_Experiment, and Assay_Result tables). 

b. ■ Abbreviations 

2Q The following abbreviations are used extensively in the data model 

described herein below, both in the table schema and in the diagram drawings 
shown in FIGURES 44A-E. 

AA: amino acid 
Clin: clinical 
Descr: description 
FK: foreign key 
Geo: geographical 
HAP: Haplotype 
ID: identifier 
Info: information 
Loc: location 
Med: medical 
35 • Mol: molecule 
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• NT: nucleotide 

• PK: primary key 

• Poly: polymorphism 

• Pos: position 

5 • ub: publication 

• QC: quality control 

• Seq: sequence 

• SNP: single nucleotide polymorphism 
10 • Sub: subject 

• . Therap: therapeutic 

Tables 

This preferred embodiment of a database of the present 
invention contains 83 tables as follows: 

1) Alignment_Component 

2) Allele 

3) Assay 

20 4) Assay_Experiment 

5) AssayResult 

6) Assembly_Component 

7) Chromosome 

8) Clasper Clone 
25 J y ^ 

9) Class^System 

10) Client^Genes 

11) Clinical_Site 

12) Clinical_Trial 
30 13) Cohort 

14) Company 

15) Company_Address 

16) Compound 
35 17) Contact 
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18) Contig 

19) Discovery Method 

20) Disease__Susceptibility 

21) Drug 

22) Drug__Target 

23) Electromc_Material 

24) Family 

25) Featurejnfo 

26) Feature_Literature 

27) Gene 

28) Gene_Alias 

29) Gene_Class 

30) Gene__Hap_Locus 

31) Gene_Map_Location 

32) Gene Nomenclature 

33) Gene_Pathway 

34) Gene_Region 

35) Gene_Transcript 

36) Genetic_Accession 

37) Genetic_Feature 

38) Genome_Map 

39) Genomic_Region 

40) Geo_Ethnicity 

41) Hap_Allele 

42) Hap_Confirmation 

43) HapJLocus 

44) Hap_Locus_Poly 

45) Hap_Locus_Subject 

46) Haplotype 

47) Ind__GeoJEthnicity 

48) Ind_Medical_History 

49) Individual 
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50) Literature 

51) Locus_Accession 

52) Med_Thesaurus 

53) Patent 

54) Patent_Fuli_Text 

55) Pathway 

56) Pathway_Literature 

57) Poly Confirmation 

58) Poly__Patent 

59) Poly^Pub 

60) Polymorphism 

61) Project 

62) Project_Gene 
15 63) Protein 

64) Publication 

65) Seq_Accession 

66) Seq_Assembly 
20 67) Seq^Text 

68) Species 

69) Splice 

70) Subject 

71) Subject_Cohort 
2^ 72) Subject_Hap 

73) Subject_Measurement 

74) Subject_Poly 

75) Therap_Drug 

30 76) Therapeutic_Area 

77) Therapeutic_Gene 

78) Transcript_Region 

79) Trial_Cohort 

80) Trial_Drug 

81) Trial_Measurement 
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si) Unordered_Contig 
83) URL 

d. Fields 

Figures 44A-E show the fields of each of the tables in the 
currently used database. The following arc descriptions of the fields in the database: 



Table Field Name PK FK Comments Relationship Explanation 

Name 



Alignment Descr 


No No free note text about the record; occurs in all tables 


Component 






Weight 


No No weight for a component to take in alignment decision making 


AlignmentJBnd No No endpfthealignof component in the contig 




Alignment_Stait No No startof the align of component in the contig 


Segment_List 


No No the actual consensus alignment text with gaps 


ComponentlD 


No Yes component used in the alignment 




Order_Num 


Yes No order of the component in the alignment 


An Alignment_Component 






is associated with exactly 






one Contig. 


Conti^ID 


Yes Yes contig constructed by the alignment 


An Alignment_Component 






is associated with exactly 






one Genetic_Feature. 


Allele Descr 


No No 




AA_Seq__Text 


No No amino acid sequence for Ae allele - 




Codon_Scq_ 


No No codon sequence 




Text 






NT_Seq_Tcxt 


No No nucleotide sequence 




Allele_Name 


No No descriptive name 




Poly_ID 


Yes Yes id of the polymorphism 


A Hap_Allele is associated 






with one to many Allele. 


Allcle_Codc 


Yes No name that reveals the allele, usually the 


A Subject_Poly is associated 




same as NT_Seq_Tcxt 


with exactly one Allele. 






An Allele is associated with 






exactly one Polymorphism. 


Assay Descr 


No No 




Assay_Type 


No No 




Assay_ID 


Yes No id for an assay 


An AssayExperiment is 






associated with exactly one 






Assay. 


Assay_Name 


No No descriptive name 





Assay_ Descr No No 

Experiment 

Exp Date No No date of experiment 
Operator No No 

Exp^Parameters No No parameters used in the experiment 
AssayJD No Yes the assay where the experiment belongs 

Exp_ID Yes No id for an experiment An Assay_Result is 

associated with exactly one 
Assay_Experimwit. 
An Assay_Exper!ment is 
associated with exactly one 

Assay. 

Assay_ Descr No No 

Result 



SUBSTITUTE SHEET (BOLE 26) 



wo 01/01218 PCT/USOO/17540 



112 



QC No No quality control of the experiment 

Assay__Result No No free text of the assay result 

HapJD Yes Yes HAP in study 

ProteinJD Yes Yes protein in study+£70 

Compound_ID Yes Yes compound in study 

Exp_lD Yes Yes the experiment 

Clone ID Yes Yes clone involved 



An Assay_Result is 
associated with exactly one 
Clasper_Clone. 
An Assay_Result is 
associated with exactly one 
Assay_Experiment 
An Assay^Resuit is 
associated with exactly one 
Compound. 
An Assay_Rcsult is 
associated with exactly one 
Protein. 



10 



Assembly_ Component_n> No Yes component used in the assembly 
Component 

Descr No No 

Order^Num Yes No order of the component in the assembly 



15 



20 



25 



30 



Class_ 
System 



35 



Assembly_ID Yes Yes id for tiie assembly 



An Assembly_Component is 

associated with exactly one 

Seq_Assembly. 

An AssembIy_Component is 

associated with zero or one 

Genetic Feature. 



No No 



Chromo- Descr 
some 

Chromosome^ No No descriptive name 
Name 

Species_ID No Yes the species of the genome 



Chromosome^ Yes Yes id for a chromosome 
ID 



A Gene_Map__Location is 

associated with exactly one 

Chromosome. 

A Gene_Nomenclature is 

associated with zero or one 

Chromosome. 

A Chromosome is associated 

with exactly one 

Genctic_Feature. 

A Chromosome is associated 

with zero or one Species. 



Clasper_ CloncJ© 
Clone 

HapJD 
Descr 
Sub ID 



Yes No id for a clone 

Yes Yes HAP the clone represents 
No No 

No Yes the individual fi^om which the clone is 
obtained 



Path_Name No No the specific path a class is defined 

Descr No No 

Class_Name No No descriptive name 

Node__Level No No level at which the class is located 

Super_ID No No theparentof the current class 

Class ID Yes No id for a class 



An Assay_Result is 
associate with exactly one 
Clasper_Clone. 
A Clasper_Clone is 
associated with zero or one 
Subjects. 

A Clasper_Clone is 
associated with exactly one 
Haplotype. 



A Gene_C]ass is associated 
with exactly one 
Class_System. 
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10 



15 



20 





Class^System 


No No the system used to define the class 




Client_ 


Request.Details No No details of the request 




Genes 


Security_Code 
Descr 

Request_Order 


No No security level of the request 
No No 

No No the physical order of the request 






Conripany_ID 


Yes Yes id for company that makes the request 


A Client_Genes is associated 
with exactly one Gene. 




Genc^ID 


Yes Yes id of the gene 


A Client_Genes is associated 
with exactly one Company. 


Clinical^ 


Descr 


No No 




Site 


Company ID 
)9iie_iNaine 


No Yes 

No No descriptive name 






ClinicaI_Site_ 


Yes No A Clinical_Site R/4] at least one Subject. 


J\. i3Uu|C%/l l9 dooUli'lalcU Willi 




ID 




A Clinical_Site is associated 

until fynf^flv ntif^ C*t\mrsan\/ 
Willi cAiiuiijr viiv v»uiii|7<uiy. 


Clinical 


Descr 


No No 


A dmicalJTrial is 


Trial 






associated with one to many 
TriaI_Drug. 




Therap_lD 


No Yes id for the therapeutic area 


A Clinical JTrial is 
associated with one to many 
Trial Cohort 




Start^Date 


No No when the trial started 


A Clinical^Trial is 
associated with one to many 
Trial^Measurement. 




Trial JD 


Yes No id 


A Trial_Dnjg is associated 
with exactly one to many 
Clinical_Trial. 




Trial_Codc 


No No code for identification purpose 


A Trial_Cohort is associated 






with ^ctly one 
CIinical_Trial. 




Trial_Nanie 


No No descriptive name 


A Trial_Measurement is 
associated with exactly one 
ClinicaLTrial. 
A ClinicalJTrial is 
associated with one 
Therapeutic Area. 



25 



Cohort Descr No No 

Cohort_Name No No descriptive name 

Cohort^ID Yes No id 

Company_ID No Yes company who owns the trial 



30 



Company 



35 



Descr 



No No 



A Cohort is associated with 
one to many Trial_Cohort- 
A Cohort is associated with 
one to many Subject_Cohoit. 
A Trial_Cohort is associated 
with exactly one Cohort. 
A Subject_Cohoit is 
associated with exactly one 
Cohort. 

A Cohort is associated with 
exactly one Company. 



A Compound is associated 
with exactly one Company, 
A Company_Address is 
associated with exactly one 
Company. 

A Clinical_Site is associated 
with exactly one Company. 
A Client_Genes is associated 
with exactly one Company. 
A Cohort is associated with 
exactly one Company. 
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No No descriptive name 



Address 



Descr 


No 


No 


Web_Site 


No 


No 


Zip 


No 


No 


Country 


No 


No 


State 


No 


No 


City 


No 


No 


Street 


No 


No 


Address ID 


Yes No 



Company_ID Yes Yes 



Compound Compound_ No No descriptive name 
Name 



A Patent is associated with 

one Company. 

A Drug is associated with 

exactly one Company. 

A Company is associated 

with one to many 

Compound. 

A Company is associated 

with one to many 

Cdmpany^Address. 

A Company is associated 

with one to many 

Clinical.Site. 

A Company is associated 

with one to many 

C]ient_Gene. 

A Company is associated 

with one to many Cohort 

A Company is associated 

with one to many Patent 

A Company is associated 

with one to many Drug. 



A Company_Address is 
associated with one to many 
Contact 

A Contact is associated with 
zerooroie 
Company^Address. 
A Company Address is 
associated with exactly one 
Company. 



Structure^ 

Handler 

Descr 



No No a handler for accessing the structure info 



No No 



Company_lD No Yes company who owns the compound 



Registration^ No No registration number of the compound 
Num 

Compound_ID Yes No id 



Patent__ID No Yes patent on the compound 



Contact Office Phone No No 
Email_Address No No 
Celt Phone No No 



A Compound is associated 
with one to many 
Assay_Result 
A Compound is associated 
with one to many Drug. 
An Assay_Result is 
associated with exactly one 
Compound. 

A Drug is associated with 
zero or one Compound. 
A Compound is associated 
with zero or one Patent 
A Compound is associated 
with exactly one Company. 
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FAX 


No No 






WeD__!Site 


INO JNO 






Descr 


No No 






Pagcr_Phone 


No No 






Department 


No No 






Pnntact ID 


Yes No 


A Cjt%n\sct JK Hfssnciaffvl with 








7d"A AT AUG 








C*AiTmanv Address 




^^Avwmanu I 

\.*oiiipimy 1 


Nn Yes 






AQuress_i u 


INO les 






Last_Nanne 


No No 






MiddIe_Name 


No No 






First_Name 


No No 






Descr 










aWl| U Wl t\Ar 








r^V lrt%M UCBblipUVC IlaillC 


/\ vvUIlUg IB oaSnlddUSU WlUl 








one tA manv 








Alt ontTtf^fit r^Atnrv^nfnt' 






Yes Yes id 


A Ali<mmpnt f^oirttvMiPtkf ic 

.^VlJ^llllldlL V^UIIILfUIICIll. la 








<ioaULri<iicu Willi CA«sviiy unc 








Contig. 








A Contig is associated with 








exactly one Genetic Feature. 


Discovery 


Descr 


No No 


A Discovery Method is 


Method 






associated with one to many 








Hap^Confiimation. 




Method^ 


No No detailed protocol 


A Discovery_Method is 




Protocol 




associated with one to many 








Poly_Confirmation. 




Method_Nanie 


No No descriptive name 


A Hap_Confirmation is 








associated with zero or one 








Discovery_Method. 




MetfiodJD 


Yes No id 


A Poly_Conf5rmation is 








associated with zero or one 








DiscoveryJMethod. 


Disease 


Poly_ID 


No Yes polymorphism in study 




Suscepti- 








bility 










Ethnic_Code 


Yes Yes ethnic group code 






ThcrapJD 


Yes Yes therapeutic area in study 


A Disease_Susceptibility is 








associated with zero or one 








Polymorphism. 




Descr 


No No 


A Disease_Susceptibility is 








associated with exactly one 








Therapeutic_Area. 




11 Hp UJ 


INO Y es ri/vr in siuay 


A Disease_Susceptibility is 








associated with exactly one 








Geo_Ethnicity. 




Susceptibility 


No No measurement of susceptibility 


A Disease_Susceptibility is 








associated with zero or one 








Haplotvpe. 


Drug 


Compound^ID 


No Yes being a compound with an ID 






Development^ 


No No stage 






Stage 








Side_Effects 


No No 






Toxicity 


No No 






Administration^ 


No No 






Route 








Descr 


No No 


A Drug is associated with 








one to many Trial_Dnig. 
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No No 

Protein_ID No Yes 

DragJD Yes No 

Ck>mmon_Name No No 

Scientific. No Ko 
Name 

GenericJMame No No 

Dnig_Class No No 

Company.ID No Yes 



protein ID if drug is a protein 
id 



classification of the drug 
company who owns the dnig 



9 



Dnig_ 
Target 



Descr No No 

Gene_ID Yes Yes the gene that the drug works on 

Drug_ID Yes Yes drug in study 



A Drug is associated with 
one to many Drug^Target 
A Drug is associated with 
one to many Thmp_Drug. 
A Trial__Drug is associated 
with exactly one Drug. 
A DrugJTarget is associated 
with exactly one Drug. 
A Therap_Drug is associated 
with exactly one Drug. 
A Drug is associated with 
zero or one Protein. 
A Drug is associated with 
zero or one Compound. 
A Drug is associated with 
exactly one Company. 



Electronic_ Receive_Date No No 
Material 

Descr No No 

Title No No 

Contents No No 

Email_Address No No 

lnfo_Source No No 

Info ID Yes Yes 



captures the referencing material 
distributed electronically 



A DrugJTarget is associated 
with exactly one Drug. 
A DrugJTarget is associated 
with exactly one Gene. 



An Electronic_Material is 
associated with exactly one 
Literature. 





Data_Type 


No No 








Aufiiors 


No No 






Family 


Descr 


No No 








Generation_Up 


No No 


number of generation into the ancestry 






Mother 


No Yes 








Father 
Family_ID 


No Yes 
Yes No 


id 


A Family is associated with 
exactly one Individual. 
A Family is associated with 
exactly one Individual. 


Feature 


Descr 


No No 






Info 












Detail_Value 


No No 


feature info value 






Feature^ 
Qualifier 
FeatureJD 


Yes No 
Yes Yes 


feature info category. 


A Feature_Info is associated 
with exactly one 
Genetic Feature. 


Feature__ 
Literature 


Descr 


No No 


feature to literature association 






Literature_ID 


Yes Yes 




A Feature_Literature is 
associated vAth exactly one 
Genetic_Feature. 




Feature_ID 


Yes Yes 




A Feature_Literature is 
associated with exactly one 
Literature. 



Gene 



A Gene_Map_Location is 
associated with exactly one 
Gene- 
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5 



10 

Gene_Syinbol No Yes standard symbol 

Descr No No 

SpeciesJD No Yes species in whidi the gene is located 

Gene_ID Yes Yes id 

15 



20 



25 



30 



35 



A Client_Genes is associated 
with exactly one Gene. 
A Seq_Gene_Location is 
associated with exactly one 
Gene. 

A Feature_Gene_Location is 
associated with exactly one 
Gene. 

A Therapeutic_Gene is 
associated with exactly one 
Gene. 

A Gene_Pati)way is 
associated with exactly one 
Gene. 

A Drug^Tafget is associated 
with exactly one Gene. 
A Gene_Class is associated 
with exactly one Gene. 
A Patent is associated witii 
zero or one Gene. 
A Project_Gene is associated 
with exactly one Gene. 
A Gene_Hap_Locus is 
associated with exactly one 
Gene. 

A Gene^Transcript is 
associated with zero or one 
Gene. 

A Gene_Region is associated 

with exactly one Gene. 

A Gene_Alias is associated 

with exactly one Gene. 

A Protein is associated with 

exactly one Gene. 

A Gene is associated with 

one to many 

Gene_Map_Location. 

A Gene is associated with 

one to many Client_Gene. 

A Gene Is associated with 

one to many 

SeqL_Gene_Location. 

A Gene is associated widi 

one to many 

Feature_Gene_Location. 

A Gene is associated with 

one to many 

Therapeutic_Gene. 

A Gene is associated with 

one to many Gene_Pathway. 

A Gene is associated with 

one to many Drug_TargeL 

A Gene is associated with 

one to many Gene_Class. 

A Gene is associated with 

one to many Patent 

A Gene is associated with 

one to many Project_Gene. 

A Gene is associated with 

one to many 

Gene_Hap_Locus. 

A Gene is associated with 

one to many 
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Genc_ 
Alias 



Descr 



No No 



GencJD No Yes 

Alias_Name No No descriptive name 

Gene Alias ID Yes No id 



Gene_ 
Qass 



Descr No No 

Class_ID Yes Yes gene classification 

Gene ID Yes Yes 



Gene_Hap Descr 
_Locus 

Hap_Locus_ID Yes Yes 



No No HAP association to the gene 



Gene ID 



Yes Yes 



Gene_Map Map_Location No No location of the gene in the genome 
^Location 

Descr No No 

Chromosome__ No Yes the chromosome 

ID 

MapJD Yes Yes id of the map 



Gene_ID Yes Yes gene 



Gcne_ Chromosome^ No Yes the standard literature for the gene 

Nomen- ID 

clature 



Descr 



No No 



Cyto_Location No No cytologicallocation of gene 



Gene_ No No 

Description 

Qene_Name No No descriptive name 



SUBSTITUTE SHEET (RULE 26) 



GeneJTranscript. 

A Gene is associated with 

one to many Gene_Region. 

A Gene is associated with 

one to many Gene_AIias. 

A Gene is associated with 

one to at least one Protein. 

A Gene is associated with 

exactly one Species. 

A Gene is associated with 

exactly one Genetic_Feature. 

A Gene is associated with 

exactly one Species. 

A Gene is associated with 

exactly one 

Gene_Nomenciature. 



A Gene_Alias is associated 
with exactly one Gene. 



A Gene_Class is associated 
with exactly one Gene. 
A Gene_Class is associated 
with exactly one 
Class System. 



A Genc_Hap_Locus is 
associated with exactly one 
Gene. 

A Gene_Hap_Locus is 
associated with exactly one 
Hap__Locus. 



A Gene_Map_Location is 
associated with exactly one 

Gene. 

A Gene_Map_Location is 

associated with exactly one 

Chromosome. 

A Gene_Map_Location is 

associated with exactly one 

Genome Map. 



A Gene^Nomenclature is 
associated with zero or one 
Gene_NomencIatur€i. 
A Gene_Nomenclature is 
associated with zero or one 
Chromosonne. 



A Gene_Nomenclature 
exactly 1 Gene. 
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Gene^Symbol Yes No standard symbol 

Most_Curreiit No No version management of the record 

Locus ID No No id 



Gene_ Descr No No 

Pathway 

Gene ID Yes Yes 



Pathway_ID Yes Yes biological pathway 



10 



Gene__ Region_Ty|>e No No genomic region type 
Region 

Region__Name No No descriptive name 



Descr No No 

Gene__ID No Yes gene it belongs to 



RegionJD Yes Yes id 



15 



20 



Genc_ Descr 
Transcript 



No No 



25 



Transcript_ No No descriptive name 
Name 

Gene_ID No Yes gene it belongs to 



Transcript ID Yes Yes id 



30 



35 



Genctic_ MolJTypc . No No molecular type of the record 
Accession 

URL ID No Yes the URL address on the web 

Source_Name No No 

Descr No No 

Accession_ No No the actual accession code 
Code 



A Gene is associated with 

exactly one 

Gene Nomenclature. 



A Gene_Pathway is 
associated with exactly one 
Pathway. 

A Gene_Pathway is 
associated with exactly one 

Gene. 



A Genc Rcgion is associated 
with one to many 
Polymorphism. 
A Polymorphism is 
associated with zero or one 
Gene_Region- 

A Genomic_Region is 

associated with exactly one 

Gene_Region. 

A Transcript_Region is 

associated with exactly one 

Gene_Region. 

A Gene_Region is associated 

with one to many 

Genomic_Region. 

A Gene_Region is associated 

with one to many 

TranscriptJRegion . 

A Gene_Region is associated 

with exactly one 

Genetic^Feature. 

A Gene_Region is associated 

with exactly one Gene. 



A Gene_Transcript is 
associated with one to many ^ 
Splice. 

A GeneJTranscript is 

associated with one to many 

Transcript_Region. 

A Splice is associated with 

exactly one 

GeneJTranscript. 

A TranscriptJRegion is 

associated with exactly one 

GeneJTranscript. 

A GeneJIYanscript is 

associated with exactly one 

Genetic_Feature. 

A GeneJTranscript is 

associated with zero or one 

Gene. 



A Genetic_Accession is 
associated with zero or one 
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Seq_Version 
Accession ID 



GI 



No No sequence version number 
Yes Yes id 



No No GI number used in GenBank 



URL. 

A Genetic_Accession is 
associated with exactly one 
Genetic Feature. 



Genetic_ 
Feature 



10 



IS 



20 



Feature ID 



Yes No id 



25 



30 



35 



the high level abstraction of genetic objects A Genetic_Accession is 

associated with exactly one 
Gcnctic_Feature. 
A Protein is associated with 
exactly one Genetic_Feature. 
A Chromosome is associated 
with exactly one 
Genetic_Feature. 
A Fcaturc_Literature is 
associated with exactly one 
Genetic__Feature. 
A Polymorphism is 
associated with exactly one 
Genetic_Feature. 
A Gene_Region is associated 
with exactly one 
Genetic_Feature. 
A Gene is associated with 
exactly one Genetic_Feature. 
A Seq_Feature_Location is 
associated with exactly one 
GeneticJFeature. 
A Feature_Gene_Location is 
associated with exactly one 
Genetic_Feature. 
A Feature_Info is associated 
with exactly one 
Genetic_Feature. 
A Gene_1Yanscript is 
associated with exactly one 
Genetic_Feature. 
A SeqL_Assembly is 
associated with exactly one 
GeneticJFeature. 
A UnoFdeied_Contig is 
associated with zero or one 
GeneticJFeature. 
A Unordered_Contig is 
associated with zero or one 
Genetic_Feature. 
A Unordered_Contig is 
associated with exactly one 
GeneticJFeature. 
A Genetic_Feature is 
associated with zero or one 
GeneticFeature. 
An AssembIy_Component is 
associated with zero or one 
Genetic_Feature, 
An Alignment_Component 
is associated with exactly 
one GeneticJFeature. 
A Contig is associated with 
exactly one GeneticJFeature. 
A Splice is associated with 
exactly one GeneticJFeature. 



Most_Current No No version management of the record 



FeatureJType No No type of the feature 



Ref_ID No No parent of a feature in term of positional 

map 

Start_Pos No No start position of the feature in its parent 



End_Pos No No end 

Complement No No whether on the reverse strand 
Descr No No 
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A Seq_Text is associated 

with exactly one 

Genetic_Feature. 

A Genetic_Feature is 

associated with one to many 

Genetic_Accession. 

A Genetic_Feature is 

associated with one to 

exactly 1 Protein. 

A Genetic_Feature is 

associated with one to many 

Chromosome. 

A Genetic^Feature is 

associated with one to many 

Feature_Literature. 

A Genettc^Feature is 

associated with one to many 

Polymorphism. 

A Genetic_Feature is 

associated with one to many 

Gene_Rcgion. 

A Genetic_Feature is 

associated with one to many • 

Genes. 

A Genetic_Feature is 
associated with one to at 
least one 

Scq_Fcature_Location. 
A Genetic_Feature is 
associated with exactly one 
to many 

Feature__Genc_Location. 

A Genet!C_Feature is 

associated with one to many 

Feature_Info. 

A Genetic_Feature is 

associated with one to many 

Gene_Transcript. 

A Genetic_Feature is 

associated with one to many 

Seq_Assembly. 

A Genetic_Feature is 

associated with one to many 

Unordered_Contig. 

A Genetic_Feature is 

associated with one to many 

Unordered_Contig. 

A Genetic__Feature is 

associated with one to many 

Unordered_Contig. 

A Genetic_FcatiiFC is 

associated with one to many 

Genetic_Fcature. 

A GeneticJFeature is 

associated with one to many 

Assembly^Component. 

A GeneticJFeature is 

associated with one to many 

Alignment_Compon(»it. 

A Gcnctic_,Feature is 

associated with one to many 

Contig. 

A GeneticJFeature is 
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associated with one to many 
Splice. 

A Genetic_Featiire is 
associated with one to many 
Seq^Tcxt 

A GeneticJFeatuic is 
associated with zero or one 

Genetic Feature. 



Oenomc_ 


ExtemaI_Key 


No No legendary key 




Map 










Descr 


No No 


A Genome_Map is 
associated with exactly one 
Species. 




M^JTypc 


No No typeof themsqp 


A Genome_Msqp is 
associated with one to many 
Gene_Map_Location. 




Map_ID 


Yes No id 


A Genome_Map is 
associated with zero or one 
Genome_Map. 




MapName 


No No descriptive name 






Most_Cunfcnt 


No No version management of die record 


A Gene_Map_Location is 
associated with exactly one 
Genome_Map. 




Species_ID 


No Yes species of the map 




Genomic^ 


Descr 


No No gene region in terms of DN A organization 




Region 










Region_ID 


Yes Yes id 


A GenomicJRegion is 
associated with exactly one 
Gene Region. 


Geo 


Rthnic GrouD 


No No the major ethnic group name 


A Disease Susceptibility is 


Ethnicity 






associated with exactly one 
Geo_Ethnicity. 




Descr 


No No 


A Ind_Geo_Ethnicity is 
associated with exactly one 






No No descriptive name 


Geo_Ethnicity. 




Ethnic_Name 


A Poly_Confinnation is 
associated with zero or one 
Geo_Ethnicity. 




Ethnic_Code 


Yes No code for a specific ethnic sub-group 


A Hap_Confirmation is 
associated with zero or one 
Geo^Ethnicity. 
A Geo_Ethnicity is 
associated with one to many 
Disease_Susceptibil ity. 
A Geo_Ethnicity is 
associated with one to many 
Ind_Geo_Ethnicity. 
A Geo_Ethnicity is 
associated with one to many 
Poly_Confirmation. 
A Geo_Ethnicity is 
associated with one to many 
Hap Confirmation. 


Hap_Allcle Descr 


No No 






PolyJD 


Yes Yes polymorphism that constituting the HAP 






Allele_Code 


Yes Yes tiie specific allele of that polymorphism 


A Hap_AlleIe is associated 
with exactly one Haplotype. 




Hap^ID 


Yes Yes HAP 


A Hap Allele is associated 
with exactly one Allele. 


Hap_ 


Sample^Size 


No No sample size in the HAP study 




Confir- 









mation 
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Extcmal__ICey No No legendary key 

QC No No quality info 

Descr No No 

Name_Alias No No other names 

$ource_Naine Yes No where reported 



Hap__LocusJD Yes Yes id 



Ethnic_Code No Yes sub-group of population 



Method_ID No Yes method used in discovery 



Hap_Locus 



the HAP built on a tocus region 



10 



Descr 



No No 



Hap_Locus_ No No descriptive name 
Name 

MostjCurrent No No version management of the record 

Hap_LociisJD Yes No id 



HapJLocus Descr 
_PoIy 

PolyJD 



No No HAP to SNP association 
Yes Yes 



Hap_Locus_ID Yes Yes 



A Hap^Confirmation is 

associated with zero or one 

Geo_Ethnicity. 

A Hap_Confirmation is 

associated with exactly one 

Hap_Locus. 

A Hap_Confirmation is 

associated with zero or one 

Discovery_Method. 



A Haplotype is associated 
with exactly one Hap_Locus. 
A Hap_Locus_Poly is 
associated with exactly one 
Hap_Locus. 
A Gene_Hap_Locus is 
associated with exactly one 
Hap_Locus. 

A Hap_Locus_Subject is 
associated with exactly <me 
HapLocus. 

A Hap_Locus is associated 
with zero or one Hap_Locus. 
A Subjcct_Hap is associated 
with exactly one Hap_Locus. 
A Hap_Confinnation is 
associated with exactly one 
Hap_Locus. 

A Hap_Locus is associated 

with zero or one Hap_Locus. 

A Hap_Locus is associated 

with one to many Haplotype. 

A Hap_Locus is associated 

with one to many 

Hap_LocusJPoly. 

A HapJLocus is associated 

with one to many 

Gene_Hap_Locus. 

A Hap_Locus is associated 

with one to many 

Hap__Locus_Subject 

A Hap_Locus is associated 

with one to many 

Hap_Locus. 

A Hap_Locus is associated 
with one to many 
Subject_Hap. 

A Hap_Locus is associated 

with one to many 

Hap Confirmation. 



A Hap_Locus_Poly is 

associated with exactly one 

Hap__Locus. 

A Hap_Locus_Poly is 

associated with exactly one 

Polymorphism. 
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Hap_Locus Hap_Locus_ID Yes Yes HAP to subject association 
^Subject 

Descr No No 



Sub ID 



Yes Yes 



Haplotype Descr No No 

Hap Name No No descriptive name 

Hap LocusJD No Yes HAP locus to which this HAP belongs 



A Hap_Locus_Subject is 
associated with exactly one 
Hap__Locu8. 

A Hap_Locus_SubJect is 
associated with exactly one 
Subject. 



Hap_ID 



Yes No id 



10 



15 



20 



Ind_Geo_ Ethnic_Code Yes Yes individual's edinic bKBckground 
Ethnicity 

IndJD Yes Yes 

Descr No No 



Genetic_Weight No No the weight of different ethnic heritage 



A Subject_Hap is associated 
with exactly one Haplotype. 
A Hq)_Allele is associated 
with exactly one Haplotype. 
A Disease_Susceptibility is 
associated with zero or one 
Haplotype. 
A Clasper_Clone is 
associated with exactly one 
H^lotype. 

A Haplotype is associated 

with one to many 

Subject_Hap. 

A Haplotype is associated 

with one to many 

Hap_Allele. 

A Haplotype is associated 
with one to many 
Disease__Susceptibility. 
A Haplotype is associated 
with one to many 
CIasper_Clone. 
A Haplotype is associated 
with exactly one Hap Locus. 



25 



Ind_Med- Descr 

ical_ 

History 

Ind ID 



Therap_ID 



No No Medical history for an individual 



Yes Yes 



An Ind_Geo_Ethnicity is 
associated with exactly one 
Individual. 

A Ind_Geo_Hthnicity is 
associated with exactly one 
Geo Ethnicity. 



Yes Yes 



30 



35 



Descr 


No 


No 


individual info 


YOB 


No 


No 


year of birth 


Gender 


No 


No 




Mother 


No 


No 




Father 


No 


No 




Species_ID 


No 


Yes 


possible for cross species study 


Ind^Type 


No 


No 




Ind Code 


No 


No 





An Ind_Medical_History is 

associated with exactly one 

Therapeutic^Area. 

An Ind_Medical_History is 

associated with exactly one 

Individual. 



An Ind_Geo_Ethnicity is 
associated with exactly one 
Individual. 

A Family is associated with 
exactly one Individual. 
A Family is associated with 
exactly one Individual. 
An Ind_Medicat_History is 
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Ind ID 



Yes No id 



associated with exactly one 
Individual. 

A Subject is associated with 
exactly one Individual. 
An Individual is associated 
with one to many 
Ind_Gco_Ethnicity. 
An Individual is associated 
with one to zero or one 
Family. 

An Individual is associated 
with zero to many 
Ind_Medical_History. 
An Individual is associated 
with zero to one Subject. 
An Individual is associated 
with exactly one Species. 



Literature Descr 

Image^File 



No No 

No No the large multimedia file for the record 



Source_Name No No 
LiteratureJType No No 

Literature ID Yes No id 



URL ID 



No Yes URL address on the web 



A Patent is associated with 
exactly one Literature. 
A Publication is associated 
with exactly one Literature. 
A Electronic_Material is 
associated with exactly one 
Literature. 

A FeatureJLiterature is 
associated with exactly one 
Literature. 

A Pathway_Literature is 
associated with exactly one 
Literature. 

A Literature is associated 
with zero or one URL. 
A Literature zero to many 
Patent. 

A Literature is associated 

with zero many Publication. 

A Literature is associated 

with zero many 

Electronic_Material. 

A Literature is associated 

with zero many 

Feature_Literature. 

A Literature is associated 

with zero many 

Pathway literature. 



Locus_ AccessionJType No No the molecule type for the sequence 
Accession 

Descr No No 

Locus_ID Yes No NCBI locus id 

Accession No No the actual accession code 



Med_ Data Source No No medical terminology 
Thesaurus 

Extemal_Key No No 

Descr No No 

Term ID Yes No 



Definition 
URL ID 



No No 
No Yes 



A MedJThesaurus is 
associated with zero or one 
URL. 
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10 



IS 



20 



25 



30 





MedicaLTenn 


No No 




Patent 


Institution 


No No patent info 






Year 


No No 






Title 


No No 


A Patent i*i asitOciateH with 








7ern manv Patent Piill T'^vt 






No No 


A Patent is aKSnciateH with 








zero many Compound. 




Granted_By 


No No 


A Patent is associated with 








zero many Poly__Patent 




Descr 


No No 


A Patent is associated with 








zero or one Gene. 




Patent_Clainis 


No No 


A Patent is associated with 








zero or one Company. 




Inventors 


No No 


A Patent is associated with 








exactly one Literature. 




PatentJD 


Yes Yes 


A Patent_Full_Text is 








associated with exactly one 








Patent. 




GeneJD 


No Yes 


A Compound is associated 








with zero or one Patent 




Patent_Num 


No No 


A Poly_Patent is associated 








with exactly one Patent. 




Company_ID 


No Yes 






PatentJType 


No No could be pending, approved, etc. 




Patent Full Descr 


No No 




JText 










FuILText 


No No the full text document 






Patcnt_ID 


Yes Yes 


A Patent Full Text is 








associated with exactly one 








Patent. 


Pathway 


Pathway Name No No biological patiiway info 


A Cipne Pathwav ic 








ac:<;ncifltecl with evactlv one 








Pathway. 




Pathway_ID 


Yes No 


A Pathwav Literature is 








associated with exactly one 








Pathway. 




Descr 


No No 


A Pathway is associated with 








one to many Gene_Pathway. 








A Pathway is associated with 








one to many 








Pathway Literature. 


Pathway^ 


Descr 


pathway literature association 




Literature 










Pathway_ID 


Yes Yes 


A Pathway_Literature is 








associated with exactly one 








Literature. 




Literature_ID 


Yes Yes 


A Pathway_Literature is 








associated with exactly one 








Pathway. 


Poly_ 


Method.lD 


No Yes polymorphism confirmation info 




Confir- 








mation 










Source_Name 


Yes No which data source 






Name^AHas 


No No alias name 






PolyJD 


Yes Yes id 






Descr 


No No 






QC 


No No quality control info 






ExtemalJCcy 


No No legendary key 


A Poly__Confirmation is 
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associated with exactly one 
Polymorphism. 
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SampIe_Size No No sizeof sample in discovery 



Ethnic_Code No Yes ethnic group info 



Poly_ 
Patent 



Descr No No polymorphism patent association 

Poly_ID Yes Yes 

Patent ID Yes Yes 



A Poly_Confirmation is 
associated with zero or one 
Discovery_Method. 
A Poly.Confirmation is 
associated with zero or one 
Geo Ethnicity. 



10 



Poly_Pub Descr 
Pub_ID 

Poly_ID 



No No polymorphism publication association 

Yes Yes 

Yes Yes 



A Poly_Patent is associated 
with exactly one Patent. 
A Poly_Patent is associated 
with exactly one 
Polymorphism. 



Poly- Mol__ 
morphism Consequence 



A Poly_Pub is associated 
with exactly one Publication. 
A Poly_Pub is associated 
with exactly one 
Polymorphism. 



15 



Prim^_P£ur_ID No No primer used in the discovery 
No No flanking sequence on 3* end 



3Flank_Seq_^ 
Text 

5Flanlc_Seq_ 
Text 

Descr 



No No flanking sequence on 5* end 



20 



25 



Region__ID 

Poly_Length 

PolyJD 

Variation_Type 
System_Name 



30 



35 



No No molecular mechanism of the polymoiphism A Subject_Poly is associated 

with exactly one 
Polymorphism. 
A Poly_Pub is associated 
with exactly one 
Polymorphism. 
A Polymorphism is 
associated with one to many 
Subject_Poly. 
A Polymorphism is 
associated with one to many 
Poly_Pub. 
A Polymorphism is 
associated with exactly one 
Genetic_Feature. 

No Yes the region where fhe polymorphism locates A Disease_Susceptibility is 

associated with zero or one 
Polymorphism. 
A Poly_Patent is associated 
with exactly one 
Polymorphism. 
A Hap_Locus_Poly is 
associated with exactly one 
Polymorphism. 
A Allele is associated with 
exactly one Polymorphism. 
A Poly_Confirmation is 
associated with exactly one 
Polymorphism. 
A Polymorphism is 
associated with zero to many 
Disease_Susceptibility. 
A Polymorphism is 
associated with zero to many 
Poly_Palent. 
A Polymorphism R/361 
many Hap_Locus_Poly. 
A Polymorphism is 
associated with at least one 
Allele, 

A Polymorphism is 
associated with at least one 
Poly_Confirmation. 



No No 



No No length of the variation 



Yes Yes id 



No No 
No No 



type of variation 

systematic name of the polymorphism 
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SlTucture_ 

Handler 

GcneJD 

Protein ID 



No No protein structure info handler 
No Yes gene it belongs to 
Yes Yes id 
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20 



25 



30 



Seq_ID 



Yes Yes id 



Species Atias_Name No No other names 



A Poiymorphism is 
associated with zero or one 



Project 


Descr 


No No project info 






Submitter 


No No 






Project_ 


No No 






Manager 








ProjectJMame 


No No 


A Project is associated with 








one to many Project_Gene. 




Project JD 


Yes No 


A Project_Gene is associated 








with exactly one Project 


Project^ 


Descr 


No No project gene association 




Gene 










Gene_ID 


Yes Yes 


A Projcct_Gene is associated 








with exactly one Project. 




ProjectJD 


Yes Yes 


A Project_Gene is associated 






with exactly one Gene. 


Protein 


Descr 


No No 


A Protein is associated with 



zero to many Drug. 
A Protein is associated with 
zero to many Assay_Result. 
A Drug is associated with 
zero or one Protein. 
An A$say_Result is 
associated with exactly one 
Protein. 

A Protein is associated wilh 

exactly one Gene. 

A Protein is associated with 



Publication Keywords 


No No 






Abstract 


No No 






Descr 


No No 






Title 


No No 






Institution 


No No 


A Publication is associated 
with zero to many Poly__Pub. 




Year 


No No 


A Publication is associated 
with exactly one Literature. 




PubJD 


Yes Yes 


A Poly_Pub is associated 








with exactly one Publication. 




Authors 


No No 






Journal 


No No 




Sc(t. 


Assembly^ 


No No the consensus sequence built from - 


A Seq_Assembly is 


Assembly 


Name 


alignment 


associated with one to many 
Assemb1y_Component. 




Descr 


No No 


A Seq^Assembly is 
associated with exactly one 
Genetic_Feature. 




AssemblyJD 


Yes Yes id 


An Assembly_C6mponent is 
associated with exactly one 
Seq Assembly. 


Scq_Text 


Descr 

Seq_Text 


No No 

No No the actual sequence text 





A Seq_Text is associated 
with exactly one 
Genetic Feature. 
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Specie5_ID Yes No id 
Descr No No 

System^Name No No systematic name of the species 

Common Name No No common name 



10 



A Gene is associated with 
exactly one Species. 
A Genome_Map is 
associated with exactly one 
Species. 

A Gene is associated with 
exactly one Species. 
A Chromosome is associated 
with zero or one Species. 
A Individual is associated 
with exactly one Species. 
A Species is associated with 
one to many Gene. 
A Species is associated with 
zero to many Genome Map, 
A Species is associated with 
one to many Gene. 
A Species is associated with 
one to many Chromosome. 
A Species is associated with 
one to many Individual. 



15 



Splice Component_lD No Yes component involved in the splicing 
Descr No No 

OrderJNum Yes No order of the component in die splicing 
product 

Transcript_ID Yes Yes id for the transcript 



A Splice is associated with 
exactly one 
GeneJTranscript. 
A Splice is associated with 
exactly one Genetic_Feature. 
A Clasper_Clone is 
associated with zero or one 
Subject. 



20 



Subject this is a subset of individual 

Descr No No 

Extemal_Key No No 



Clinical_Site_ No Yes collection site 
ID 



Sub ID 



Yes Yes id 
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A Subject_Poly is associated 
with exactly one Subject. 
A Subject_Hap is associated 
with exactly one Subject. 
A Subject_Cohort is 
associated with exactly one 
Subject. 

A Subject_Measurement is 
associated with exactly one 
Subject. 

A Hap_Locus_Subject is 
associated with exactly one 
Subject. 

A Subject is associated with 

zero to many Clasper Clone. 

A Subject is associated with 

zero to many Subjcct_Poly. 

A Subject is associated with 

zero to many Subject_Hap. 

A Subject is associated with 

zero to many 

Subject_Cohort. 

A Subject is associated with 

zero to many 

Subject_Measurement 

A Subject is associated with 

zero to many 

Hap_Locus_Subject. 

A Subject is associated with 

exactly one Clinical_Site. 
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A Subject is associated with 










exactly one Individual. 




Subject, 


Cohort_ID 


Yes Yes cohort subject association 






Cohort" 












Descr 


No No 


A SubjectjCohort is 










associated with exactly one 










Subject. 








Yes Yes 


A SubjectjCohort is 


5 








associated with exactly one 










Cohort. 




Subject. 


riap_L0CU5_lD 


Yes Yes subject HAP typing info 






nap 












Copy__Num 


Yes No idrattfythecopyoftheHAP 








QC 


No No quality control data 


A Siihiect Han is assaciatcd 










with exactly one Haplotype. 






Descr 


No No 


A Subjectjtlap is associated 










with exactly one Subject. 


10 




HapJD 


No Yes id of HAP 


A $ubject_Hap is associated 










with exactly one HapJLocus. 






Sub_ID 


Yes Yes id of subject 






Subject 


Measure^Num 


Yes No subject clinical measurement 






Measure- 










ment 












Measure_RcsuIt No No result of the measurement 








Measure_ID 


Yes Yes id 




15 




Descr 


No No 








Operator 


No No who did it 








QC 


No No quality control data 


A SubjectJMeasurement is 










associated with exactly one 










Subject. 






Measure_pate 


No No when it's done 


A Subject_Measurement is 










associate with exactly one 










1 rial IVlPoSUiCfTlClH. 


20 




SubJD 


Yes Yes subject being measured 




Subject_ 


PolyJD 


Yes Yes subject genotyping info 






roiy 












Copy_Num 


Yes No identify the copy of the SNP 








Descr 


No No 


A Subject^Poly is associated 










with exactly one Subject. 






Atlele_Code 


No Yes the allele for the subject 


A Subject_Poly is associated 










with exactly one AUele. 


25 




QC 


No No quality control data 


A Subject_Poly is associated 






with exactly one 










Polymorphism. 






Descr 


No No 






Therap_ 


Drug^ID 


Yes Yes drug info for the therapeutical area 


A Therap^Drug is associated 




Drug 






with exactly one 










Therapeutic_Area. 






Therap^ID 


Yes Yes 


A TherapJDiug is associated 










with exactly one Drug. 


30 








A Therap_prug is associated 








with exactly one 










Therapeutic_Area. 



Thera- 
peutic_ 
Area 



Descr 



No No the look up table for the therapeutic areas 



Related Area No No 



35 



A Therapeutic_Gene is 

associated with exactly one 

Therapeutic_Area. 

A Ind_Medical_History is 

associated with exactly one 

T1ierapeutic__Area. 
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Therap^Area 
TherapJD 



No No 



Yes No 



10 



A Disease_SusceptibiItty is 

associated with exactly one 

Therapeutic_Area. 

A CImical_Trial is 

associated with zero or one 

Therapcutic_Area. 

A Therapeutic_Area is 

associated with zero to many 

Therap.Dnig. 

A Therapeutic_AFea is 

associated with zero to many 

Therapeutic_Gene. 

A Thaapeutic_Area is 

associated with zero to many 

lnd_MedicaLHistory. 

A Therapeutic_Area is 

associated with zero to many 

Disease_Susceptibility. 

A Therapeutic_Area is 

associated with zero to many 

Clinical Trial. 



15 



20 



Thcra- 


Descr 


No No gene links to the therapeutic areas 




peutic_ 








Gene 










TherapJD 


Yes Yes 


A Therapeutic_Gene is 








associated with exactly one 








Therapeutic_Area. 




GeneJD 


Yes Yes 


A Therapeutic_Gene is 








associated with exactly one 








Gene. 


Tyanscript_ 


Dcscr 


No No 




Region 










Transcript_ID 


No Yes link between gene region and the transcript A Transcnpt_Region is 








associated with exactly one 








Gene_Region. 




Regional D 


Yes Yes 


A Transcript_Region is 






associated with exactly one 








Gene Transcript. 


TriaL 


Descr 


No No 




Cohort 










Cohort JD 


Yes Yes cohort involved in the clinical trial 


A Trial_Cohoft is associated 








with exactly one 








Clinical_Trial. 




TrialJD 


Yes Yes 


A Trial_Cohoit is associated 








with exactly one Cohort. 


Trial J)rug Dcscr 


No No 






Trial JD 


Yes Yes drug used in the clinical trial 


A Trial_Drug is associated 








with exactly one Drug. 




Drug^ID 


Yes Yes 


A TriaI__Drug is associated 








with exactly one 








Clinical Trial. 


Trial_ 


MeasureJName 


No No Recording of the clinical measurement 




Measure- 








ment 










Measure__ 


No No measurement result 






Details 








Descr 


No No 






MeasureJType 


No No type 






Measure^ 


No No abbreviation form of the measurement 


A Trial_Measurement is 




Abbrev 


name 


associated with one to many 



SubJect_Measurement. 
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Measure ID Yes No id 



Trial ID 



No Yes trial in which the measurement is taken 



A Subject_MeasuFment is 
associated with exactly one 
Trial_MeasuFement. 
A Trial_Measurement is 
associated witfi exactly one 
Clinical Trial. 



Unordered Descr No No a table to handle the unordered sequence 
_Contig pieces 

Uncontig_Seq_ No Yes the actual sequence corresponding 

ID 



Uncontig^Ust_ No Yes the accession in which it's reported 
ID 

UncontigLiD Yes Yes id 



10 



URL 



URL 



No No the URL address 



Most_CuiTent No No version management for the record 



15 



URL^ID 
Descr 



Yes No id 
No No 



20 



A Unordcrcd_Contig is 

associated with exactly one 

Genetic_Feature. 

A Unordered_Contig is 

associated widi zero or one 

GeneticJPeature. 

A Unordered_Contig is 

associated widi zero or one 

Genetic Feature. 



A Gcnctic_Accession is 
associated with zero or one 
URL. 

A Mcd_Thesauru$ is 
associated with zero or one 
URU 

A URL is associated with 

zero or one URL. 

A Literature is associated 

with zero or one URL. 

A URL is associated with 

zero or one URL 

A URL is associated with 

zero to many 

Genetic_Accession. 

A URL is associated with 

zero to many 

Med„Thesaurus. 

A URL is associated with 

zero to one URL. 

A URL is associated with 

zero or one Literature. 



25 

G. BUSINESS MODELS 

1. Hap2000 Partnership 

The haplotype and other data developed using the methods 

30 

and/or tools described herein may be used in a partnership of two or more 
companies (referred to herein as the Partnership) to integrate knowledge of human 
population and evolutionary variation into the discovery, development and delivery 
of pharmaceuticals. The partners in the partnership may be classified as 

35 
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pharmaceutical, bioplwniaceutical, biotechnology, genomics, and/or combinatorial 
chemistry companies. One of the partners, referred to herein as the HAP™ 
Company, will provide the other partner(s) with the tools needed to address drug 
response problems that are attributable to human diversity. 
5 The HAP™ Company will focus on identifying 

polymorphisms in genes and/or other loci found in a diverse set of individuals, 
information on which will be stored in a database (referred to herein as the 
Isogenomics™ Database). Preferably, the database is designed to store 
polymorphism information for at least 2000 genes and/or other loci that are 
important to the pharmaceutical process. In a preferred embodiment, the 
polymorphisms identified are gene specific haplotypes and the genes chosen for 
analysis will be prioritized by the HAP™ Company by pharmaceutical relevance. 
Analyzed genes may include, while not being limited to, known drug targets, G- 
IS coupled protein receptors, converting enzymes, signal transduction proteins and 
metabolic enzymes. The database will be accessible through an informatics 
computer program for epidemiological correlation and evaluation, a preferred 
embodiment of which is the DecoGen™ application described above. 

20 

a.. Partnership Benefits 

i. Isogenomics™ Database 

The partners will have non-exclusive access to the 
25 Isogenomics^*^ Database, which contains the frequencies, sequences and 

distribution of the polymorphisms, e.g., gene haplotypes, found in a diverse set of 
individuals, referred to herein as the index repository, which preferably represents 
all the ethnogeographic groups in the world. Haplotypes in the database preferably 
include polymorphisms found in the promoter, exons, exon/intron boundaries and 

30 

the 5' and 3' untranslated regions. Preferably, the number of individuals examined 
in the index repository allows the detection of any haplotype whose frequency is 
1 0% or higher with a 99% certainty. 
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ii. Informatics Computer Program 

The information within the Isogenomics™ Database is part of 
the HAP^'^ Company's informatics computer program which is accessible through 
an intuitive and logical user interface. The informatics program contains algorithms 
5 . for the reconstruction of relationships among gene haplotypes and is capable of 
abstracting biological and evolutionary information from the Isogenomics™ 
Database. The informatics program is designed to analyze whether genes in the 
Isogenomics™ Database are relevant to a clinical phenotype, e.g., whether they 
jQ correlate with an effective, inadequate or toxic drug response. In a preferred 

embodiment, the program also contains algorithms designed for detecting clinical 
outcomes that are dependent upon cooperative interactions among gene products. In 
this embodiment, the computer system has the capability to simulate gene 
interactions that are likely to cause polygenic diseases and phenotypes such as drug 
response. The informatics computer program will be installed at a site selected by 
each partner(s). The information in the Isogenomics™ database will be of 
immediate use to drug discovery teams for target validation and lead prioritization 
and optimization, to drug development specialists for design and interpretation of 
20 clinical trials, and to marketing groups to address problems encoimtered by an 
approved drug in the marketplace. 

iii. Cohort Haplotvoing 

In one preferred embodiment, partner(s) can use the 
genotyping and/or hs^lotyping capabilities of the HAP™ Company to stratify their 
clinical cohorts, which will enable the partner(s) to separate cohorts by drug 
response. For a fixed fee per patient, the HAP™ Company will genotype and/or 
haplotype Phase II, Phase III, and Phase IV patient cohorts under good laboratory 
30 conditions (GLP) conditions that will allow submittal of the data to clinical 

regulatory authorities. Preferably, the clinical genotype and/or haplotype data is 
deposited within a component of the informatics computer program that is 
. proprietary to the partner to allow the partner to correlate polymorphisms such as 
2^ gene haplotypes with drug response. 



25 
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iv. Isogene Clones 

Partner(s) will have access to the physical clones that 
correspond to each of the haplotypes for a given gene or other locus. These isogene 
clones can be used in primary or secondary screening assays and will provide useful 
^ information on such pharmacological properties as drug binding, promoter strength, 
and functionality. 

V. Gene Selection by Partners 

The partners can select genes (or other loci) of their choosing 
for haplotyping in the index repository. The genes selected can be in the public 
domain or proprietary to the partner(s). In a preferred embodiment, haplotyping 
results for a proprietary gene will only be accessible by the owner of that gene until 
sequence information for the gene enters the public domain. 

15 

vi. Patent Dossier 

In a preferred embodiment, the Isogenomics^'^ Database also 
contains public patent information that is available for each gene in the database. 
This feature provides the partner(s) with an understanding of the potential 
proprietary status of any gene in the database. 

vii. Committed Liaison 

In a preferred embodiment, the HAP™ Company will assign 
25 a Ph.D. level scientist as a liaison to a partner to facilitate communication, 
technology transfer, and informatics support. 

viii. Special Services: cDNAs and Genomic Intervals 
In a preferred embodiment, the HAP™ Company will also 

30 

provide, at an extra charge, special molecular, biological and genomics services to 
partner(s) who submit cDNAs or ESTs to be haplotyped. cDNAs or ESTs v^ll be 
utilized to retrieve genomic loci and to create special haplotyping assays that will 
allow the gene locus at the chromosome level to be haplotyped in the index 
35 repository. Genomic intervals containing possible genes of high significance for 
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phenotypic correlations stemming from positional cloning programs can also be 
submitted by partner(s) for haplotyping. 

b* Membership in the Partnership 

Each pailner(s) will pay the HAP™ Company a fee for 
membership in the Partnership, preferably for a period of at least two or three years. 
Companies joining the Partnership may utilize the resources of the informatics 
computer program and Isogenomics™ Database on a company wide basis, including 
groups in drug discovery, medicinal chemistry, clinical development, regulatory 
affairs, and marketing. 



c. Envisioned Ontcomcs From The Partnership 

It is contemplated that novel isogenes will be isolated and 
J J characterized by the HAP™ Company, as well as methods for the detection of novel 
SNP's or haplotypes encompassed by the isogenes. 

It is also contemplated that associations between clinical 
outcome and haplotypes (hereinafter *%apIotype association'^) for many of the genes 
in the Isogenomics^'^ Database will be discovered* Therefore, it is also 
contemplated that methods of using the haplotypes and/or isogenes for diagnostic or 
clinical purposes relating to disease indications supported by the particular 
association will be discovered. 

It is fiirther contemplated there will be successful applications 
25 of the data and informatics tools for drug approval and marketing. 

A nimiber of different scenarios for using the database and/or 
analytical tools of the present invention may be envisioned. These include the 
following: 

2^ 1 . A Partner selects a candidate gene or genes from the HAP™ 

Company's database that is haplotyped. The Partner provides clinical cohorts for 
haplotype analysis and provides clinical response data for the cohorts. The HAP™ 
Company performs haplotype analysis for the candidate gene(s) in the clinical 
cohorts, finds new haplotypes, if any, and determines the association between one or 

■'^ more haplotypes and clinical response using the informatics computer program. 
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2. The Partner selects a candidate gene from the HAP™ 
Company's database that is haplotyped. The Partner provides clinical cohorts for 
haplotype analysis. The HAP™ Company does haplotype analysis, finds new 
haplotypes, if any, and sends the haplotype data to the Partner. The Partner 
determines the association between haplotype and clinical response using the 
informatics computer program provided by the HAP''*^ company. 

3. Like 1 above, but the Partner performs the haplotype analysis 
and determines the association between haplotype and clinical response. 

4. Like 2 above, but the Partner performs the haplotype analysis. 

5. A Partner provides one or more genes to the HAP^^ 
Company for haplotype analysis. The HAP™ Company clones and characterizes 
isogenes for the gene(s), discovers new polymorphisms in the gene, if any, and 
determines the haplotypes for the gene(s). 

15 6. Based on polymorphisms observed in a gene or genes, a 

Partner sends the HAP™ Company clinical cohorts to haplotype and the Partner 
uses the haplotype data in conjunction with their own clinical response data to 
determine the association between haplotype and clinical response. 

7. A Partner sends the HAP™ Company a cDNA or an 
expressed sequence tag (EST), The HAP™ Company isolates and characterizes the 
gene corresponding to the cDNA or EST. The HAP™ Company clones isogenes of 
the gene and determines the haplotypes embodied within the isogenes. 

A more detailed description of how the database and/or 
anal)^cal tools of the present invention may be used in the context of clinical trials 
is set forth below. 

As a review, the standard routine procedtu-e in premarketing 
development of a new drug to be used in humans is to conduct pre-clinical animal 
30 toxicology studies in two or more species of animals followed by three phases of 

clinical investigation as follows: Phase I-clinical pharmacology investigations with 
attention to pharmacokinetics, metabolism, and both single dose and dose-range 
. safety; Phase Il-limited size closely monitored investigations designed to assess 
efHcacy and relative safety; Phase Ill-full scale clinical investigations designed to 
provide an assessment of safety, efficacy, optimum dose and more precise definition 
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of drug-related adverse effects in a given disease or condition. In other words. 
Phase I and Phase II are the early stages of the drug's development, when the safety 
and the dosing level are tested in a small number of patients. Once the safety and 
some evidence that the drug is effective in treatment have been established, the 
5 drug's developer then proceeds to Phase III. In Phase III, many more patients, 

usually several hundred, are given the new drug to see whether the early findings 
that demonstrated safety and effectiveness, will be borne out in a larger number of 
patients. Phase III is pivotal to learning hard statistical facts about a new drug. 
Larger numbers of patients reveal the percentage of patients in which the drug is 
effective, as well as give doctors a clearer imderstanding about the side effects 
which may occur. 

In the research or discovery phase, a Partner*s discovery 
personnel may desire haplotype information for isogenes of a gene, and/or one or 

IS rnore clones containing isogenes of the gene, regardless of whether or not clinical 
trials (or field trials, in the case of plants) are planned, in progress, or completed. 
For example, the Partner may be studying a gene (or its encoded protein) and by be 
interested in obtaining information concerning, e.g., protein structure or mRNA 

20 structure, in particular information concerning the location of polymorphisms in the 
mRNA structure and their possible effect on mRNA transcription, translation or 
processing, as well as their possible effect on the structure and function of the 
encoded protein. Such information may be useful in designing and/or interpreting 
the results of laboratory test results, such as in vitro or animal test results. Such 

25 

information may be useful in correlating polymorphisms with a particular result or 
phenotype which may indicate that the gene is likely to be responsible for certain 
diseases, drug response or other trait. Such information could aid in drug design for 
pharmaceutical use in humans and animals, or aid in selecting or augmenting plants 
30 or animals for desired traits such as increased disease or pest resistance, or increased 
fertility, for agricultural or veterinary use. The Partner may also be interested in 
knowing the frequency of the haplotypes. Such information may be used by the 
. Partner to determine which haplotypes are present in the population below a certain 
frequency, e,g., less than 5%, and the Partner may use this information to exclude 
studying the isogenes, mRNAs and encoded proteins for these haplotypes and may 
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also use this information to weed out individuals containing these haplotypes from 
their proposed clinical trials. 

When information such as that described above is desired by 
a Partner, then the HAP^*^ Company may give access to the Partner to all or part of 
5 the data and/or analytical tools exemplified herein by the DecoGen^^ Informatics 
Platform. The Partner may also be given access to one or more clones containing 
isogenes, e.g., a genome anthology clone (see, e.g., US Patent Application Sen No. 
60/032,645, filed December 10, 1996 and US Patent Application Ser. No. 
08/987,966, filed December 1 0, 1 997). 

10 

During a Phase I clinical trial, which is being conducted to 
determine the safety of a drug (or drugs) in people, a Partner may desire haplotype 
information for haplotypes of a gene, and/or one or more clones containing isogenes 
of the gene, in particular when toxicity or adverse reactions to the drug are observed 
IS in at least some of the people taking the drug. In that case, the Partner may request 
that the HAP™ Company obtain, for each person experiencing toxicity or other 
adverse effect, the haplotypes for one or more genes which are suspected to be 
associated with the observed toxicity or adverse effect (e.g., a gene or genes 
associated v^th liver failure) and determine whether there is a correlation between 
haplotype and the observed toxicity or adverse effect. If there is a correlation, then 
the Partner may decide to keep all people having the haplotype correlated with 
toxicity or other adverse effect out of Phase II clinical trials, or to allow such people 
to enter Phase II clinical trials, but be monitored more closely and/or given 
conjunctive therapy to modify the toxicity or other adverse effect. The HAP™ 
Company may provide a diagnostic test, or have such a test prepared, which will 
detect the people which have, or lack, the haplotype correlated with toxicity or other 
adverse effect. 

During a Phase II clinical trial, which is being conducted to 
determine the efficacy of a drug (or drugs) in people, a Partner may desire haplotype 
information for haplotypes of a gene, and/or one or more clones containing isogenes 
of the gene, in particular when the results of the trial are ambiguous. For example, 
the results of a Phase II clinical trial might indicate that 50% of the people given a 
drug were responders (e.g., they lost weight in a trial for an anti-obesity drug, albeit 
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to different degrees), 49,9% of people were non-responders (e.g., they did not lose 
any weight) and 0.1% had adverse effects. In such a case, the Partner may, for 
example, request that the HAP™ Company obtain, for each of person in the Phase II 
clinical trial, the haplotypes for one or more genes which are suspected to be 
5 associated with the drug response. (In general, such gene(s) will be different from 
the gene associated with the adverse effect, but not necessarily.) A correlation may 
then be obtained between various haplotypes, and the observed level of response to 
the drug. If a correlation is found, this information may be used to determine those 
individuals in which the drug will or will not be effective and, therefore, identify 

10 

who should or should not get the drug. In addition, the information may also be 
used to develop a model (or test) which will predict, as a function of haplotype, how 
. much of the drug should be used in an individual patient to get the desired result 
Again, the HAP™ Company may provide a diagnostic test, or have such a test 

15 prepared, which will detect the people which have, or lack, the haplotype correlated 
with the eflicacy or non-efiicacy of the drug. 

During Phase III clinical trials, which are being conducted to 
verify the safety and efficacy of a drug (or drugs) in people, a Partner may desire 

2Q haplotype information for isogenes of a gene, and/or one or more clones containing 
isogenes of the gene, in particular to use at the beginning of the trial to design 
cohorts of patients (i.e., a group of individuals which will be treated the same). For 
example, the drug or placebo can be given to a group of people who have the same 
haplotype which is expected to be correlated with a good drug response, and the 

25 

drug or placebo can be given to a group of people who have the same haplotype 
which is expected to be correlated with no drug response. The results of the trial 
will confirm whether or not the expected correlation between haplotype and drug 
response is correct. 

30 During "Phase IV," which involves monitoring of clinical 

results after FDA approval of a drug to obtain additional data concerning the safety 
: and efficacy of a drug (or drugs) in people, a Partner may desire haplotype 
. information for a gene, and/or one or more clones containing isogenes of the gene, 
in particular if additional adverse events (or hidden side effects) become apparent 
In such a case, the methods described above can be used to identify people who are 
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likely to experience such adverse events. 

After clinical trials are successfully completed, a Partner may 
desire haplotype information for isogenes of a gene, and/or one or more isogene 
clones, in particular in the situation where the drug is what is known as a "me too" 
5 drug, i.e;, there are already a number of drugs on the market used to treat the disease 
or other condition which the Partner's drug is designed to treat. This can be used, 
e.g., as a marketing or business development tool for the Partner and/or help health 
care providers, such as doctors and HMOs, to keep drug costs down. For example, 
the haplotype information and ana]3^ical tools of the invention may be used to 
identify the patients for which the Partner's drug will work and/or for whom the 
Partner's drug will be superior to (or cheaper than) the other drugs on the market. A 
test can be developed to identify the target patients. This test can be diagnostic for 
the condition (e.g., it could distinguish asthma fix)m a respiratory infection) or it 

15 could be diagnostic for response to the drug. Preferably the doctor can perform the 
test in his office or other clinical setting and be able to prescribe the appropriate 
drug immediately, or after access to part or all of the database or analytical tools of 
the invention. This will also aid the doctor in that it niay provide information about 

2Q which drugs not to give, since they will not be effective in the patient. Again, this 
reduces costs for the patient and/or health care provider, and will likely accelerate 
the time in which the patient will receive effective treatment, since time may be 
saved by eliminating trial and error administrations of other dmgs which would not 
be expected to work for the disease or condition manifested by the patient. 

25 

If clinical trials are imsuccessfuUy completed, a Partner may 
desire haplotype information for isogenes, and/or one or more isogene clones 
containing isogenes of the gene, to correlate drug response with haplotype and to 
use as an aid in designing an additional clinical trial (or trials), as discussed 
30 elsewhere herein. 

The database and analytical tools of the invention are 
envisioned to be useful in a variety of settings, including various research settings, 
. pharmaceutical companies, hospitals, independent or commercial establishments. It 
is expected users will include physicians (e.g., for diagnosing a particular disease or 
prescribing a particular drug) pharmaceutical companies, generics companies. 
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diagnostics companies, contract research organizations and managed care groups, 
including HMOs, and even patients themselves. 

However, as discussed above, it is obvious that various 
aspects of the invention may be useful in other settings, such as in the agricultural 
5 and veterinary venues. 

The following examples illustrate certain embodiments of the 
present invention, but should not be construed as limiting its scope in any way. 
Certain modifications and variations will be apparent to those skilled in the art from 
the teachings of the foregoing disclosure and the following examples, and these are 
• intended to be encompassed by the spirit and scope of the invention. 

2. Mcdnostics Program 

The Mednostics™ program is a program in which one 
15 company, i.e., the HAP™ Company, uses HAP Technology to analyze variation in 
response to drugs currently marketed by third parties, in the hope of conferring a 
competitive advantage on these companies. It is expected that this technology will 
provide pharmaceutical companies with information that could lead to the 
development of new indications for existing drugs, as well as second generation 
drugs designed to replace existing drugs nearing the end of their patent life. As a 
result, the Mednostics program will benefit pharmaceutical companies by allowing 
them to extend the patent life of existing drugs, revitalize drugs facing competition 
and expand their existing market. Entities such as HMOs and other third-party 
25 payers, as well as pharmacy benefit management organizations, may also benefit 
from the Mednostics program. 

The goals of the Mednostics™ program are to find HAP 

Markers that: 

3Q • identify individuals who are currently not undergoing therapy for a given 

disease yet are at risk and will respond well to a given drug. This application 
would be useful in markets that have high growth potential and involve 
conditions that are undertreated, such as many central nervous system disorders 
and cardiovascular disease; and 

35 

• identify individuals who will respond better to one drug within a competitive 



wo 01/01218 



PCTAJSOO/17540 



-139- 

o 

class than other drugs in the same class or to one competing class of drugs as 
compared to another class of drugs. This application would allow drugs that are 
not selling well to gain a greater market share and would be best applied to a 
drug that was not the first introduced into the market and is having difficulty 
S gaining market share against the established competitors. Alternatively^ if 

multiple drug classes are indicated for the s£une disease, they could be 
differentiated by HAP Markers, thus giving drugs within one class a competitive 
advantage over the other class. 

An example of the Mednostics™ program involves the statin 
class of drugs, which are used to treat patients with high cholesterol and lipid levels 
and who are therefore at risk for cardiovascular disease. This is a highly 
competitive maricet with multiple approved products seeking to gain increased 
maiket share. For example, three of the most commonly prescribed statins are 
IS pravastatin (sold by Bristol-Myers Squibb Company as Pravacol), atorvastatin (sold 
by Parke-Davis as Lipitor), and cerivastatin (sold by Bayer AG as Baycol). The 
statin market is currently approximately $1 1 billion worldwide and is forecasted to 
at least double in size by 206s. Identification of genetic markers that would allow 
2Q the right drug to reach the right patient would allow a company to boost its market 
share and improve patierit compliance, which are both particularly important factors 
when maximizing profit fi-om drugs that are taken over the course of a lifetime. 

H. EXAMPLE 1 

25 

SIMULATED CLINICAL TRIAL 

For illustration, we will use a particular example that shows 
how the CTS™ method works, and how the DecoGen™ application is used. For 
2^ this we have simulated a data set. Polymorphisms for the gene CYP2D6 were 

obtained from the literature. From those we constructed 10 haplotypes. A set of 
individual subjects were created and assigned a value of the variable "Test" in the 
range fi'om 0.0-1.0. They were also assigned 2 of the haplotypes. This data set 
simulates what would come fi-om a clinical trial in which patients were haplotyped 
and tested for some clinical variable. Most individuals have a relatively low value of 
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the Test measure, but a small number have a large value. This simulates the case 
where a small number of individuals taking a medication have an adverse reaction. 
Our goal is to find genetic markers (i.e. haplotypes) that are correlated with this 
adverse event. 

Step 1. Identify candidate genes. CYP2D6 is the sample 

candidate gene. 

Step 2. Define a Reference Population. A standard 
population is used. An example is the CEPH families and unrelated individuals 
whose cell lines are commercially available, (Source Coriell Cell Repositories, 
URL: http.7/locus.umdnj.edu/nigms/ceph/ceph.html) Coriell sells cell lines from the 
CEPH families (a standard set of families from the United States and France for 
which cells lines are Available for multiple members from several generations from 
several families) and froni individuals firom other ethnogeographic groups. The 
15 CEPH families have been widely studied. The cell lines were originally collected by 
Foundation Jean DAUSSET (http://landm.cephb.fr/). 

Step 3. DNA fi-om this refei^nce population is obtained. 
Step 4. Haplotype individuals in the reference population. 
2Q We use either direct or indirect haplotyping methods, or a combination of both, to 
obtain haplotypes for the CYP2D6 gene in the reference population. The 
polymorphic sites and nucleotide positions for these individuals are given in 
FIGURES 4Aand4B. 

Steps. Get population averages and other Statistics. The 
haplotypes and population distributions are shown using the DecoGen^^ application 
in FIGURES 4A, 4B, 10, and 1 1 . They are determined by the methods and 
equations described in Item 5 above. 

Step 6. Determine genotyping markers. By examining the 
30 linkage data (FIGURE 15) we see that all of the sites are tightly linked except 2 and 
8. This indicates that this set should be a minimal set for genotyping. From this it 
was decided to genotype patients in the clinical trial at only these sites. 

Step 7. Recruit a trial population. In this case we use the 
reference population as the clinical population, having only added the simulated 
values of Test. 



25 



35 
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Step 8, Treat, test and haplotype patients. All patients are 
measured for the Test variable. All of the patients were then genotyped at sites 2 and 
8 (i.e. unphased haplotypes were found at these sites). Next their haplotypes are 
found directly (for those individuals who were totally homozygous or heterozygous 
5 at any one site) or inferred using maximum likelihood methods based on the 
observed haplotype frequencies in the reference population. 

Step 9. Find correlation's between haplotype pair and clinical 
outcome. We measure the value of Test. 

First we examine the results of the single site regression 

10 

model (FIGURE 21) to determine to sites showing the strongest correlation with 
Test. From this we see that sites 2 and 8 have a strong correlation, at the 99% 
confidence leveL 

The statistics for each of the sub-haplotype pair groups (using 
15 sites 2 and 8) is shown in FIGURES 18, 19, and 22. From this we see that 

individuals homozygous for TA at sites 2 and 8 have a high value of Test (average 
of 0.93). One conclusion we can make from this data is that patients homozygous 
for TA are likely to have an adverse reaction. A typical haplotype pair distribution is 
2Q shown in detail in FIGURE 20. 

We can use the ANO VA calculation to see whether grouping 
individuals by haplolype-pair (or sub-haplotype-pair) helps explain the observed 
variation in response in a statistically significant way. If ANOVA indicates that 
there is a significant group-to*group variation, then we can investigate this 

25 

correlation further using the regression and clinical modeling tools. From FIGURE 
23, we see that there is a significant level of group-to-group variation even at the 
99% confidence level. This says that the haplotype-pair (or sub-haplotype-pair) that 
an individual has for this gene does have a significant impact on that individual's 
30 value of Test. 

Step 10. Follow-up trials are run. Additional trials should be 
run to accomplish 2 goals. The first would attempt to prove the correlation between 
being homozygous for haplotype TA and the high value of Test. One way to do this 
would be to enroll a group of subjects and break them into 4 cohorts. The first and 
second would be homozygous for TC. The second and third would have no copies 
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of TC. The first and third group should take the medication causing the high value 
of Test and the second and fourth should take a placebo. The cohorts and their 
expected response are shown in the following matrix: 



Cohort 1 

TC/TC 

Medication 

Expectation: High value of Test 


Cohort 2 

TC/TC 

Placebo 

Expectation: Low value of Test 


Cohort 3 

Not-TC/not-TC 

Medication 

Expectation: Low value of Test 


Cohort 3 

Not-TC/not-TC 

Placebo 

Expectation: Low value of Test 



If we see this pattern of response, then the link between TC 
homozygosity and high value of Test, the correlation is proven. 

Step 1 1 • Design a genotyping method to identify a relevant 
set of patients. Using the Genotype view tool in the DecoGen browser, we foxmd 
that by genotyping individuals at sites 2 and 8 we could classify the group with high 
value of Test with 100% certainty. The results are shown in FIGURE 14. 

I. EXAMPLE 2 

1. Provision Of Clinical Data 

DNA sequence information for a cohort of normal subjects 
was obtained and entered into the database as described previously. For this 
example, 1 34 patients, all of whom came to the clinic having an asthmatic attack, 
were recruited. Each patient had a standard spirometry workup upon entering the 
clinic, was given a standard dose of albuterol, and was given a foUowup spirometry 
workup 30 minutes later. Blood was drawn from each patient, and DNA was 
extracted from the blood sample for use in genotyping and haplotyping. Clinical 
data, in the form of the response of the asthmatic patients to a single dose of 
nebulized albuterol, was obtained from the asthmatic patients, as described 
previously (Yan, L., Galinsky, R.E., Bernstein, J.A., Liggett, S.B. & Weinshilboum, 
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R.M. Pharmacogenetics, 2000, 10:261 -266)The clinical data was entered into the 
database, and displayed as in Fig. 29B. 

2. Determination Of ADBR2 Genotypes And Hapiotvpes 

^ Haplotypes for ADBR2 were determined using a molecular 

genotyping protocol, followed by the computational HAPBuilder procedure (See 
U.S. patent application serial No. 60/198,340 (inventors: Stephens, et al.), filed 
April 1 8, 2000). Comparison of the sequences resulted in the identification of 
thirteen polymorphic sites. 

The ADBR2 gene was selected from the screen shown in Fig. 
26. The polymorphism and haplotype data for the ADBR2 gene among normal 
subjects was as displayed in Fig. 28. Only twelve different haplotypes were 
observed and/or inferred. Diplotype and haplotype data for the ADBR2 gene 
15 among the asthmatic patients was as displayed in Fig. 29A. 

The heterozygosity of individual patients at each polymorphic 
site was as displayed in Fig. 30. At each polymorphic site (SNP), each patient has 
zero, one, or two copies of a given nucleotide. The same is true of combinations of 
SNPs: for any collection of two or more SNPs (i.e., a haplotype or sub-haplotype), a 

20 

patient will have zero, one, or two alleles having that particular coinbination of 
SNPs. 

3. Correlation Of ADBR2 Haplotypes 

And Haplotype Pairs With Drug Response 

25 

The measure of delta %FEV1 pred. was chosen as the clinical 
outcome value for which correlations with ADBR2 haplotypes were to be sought. 

a. Build-Up Procedure fTo 4 SNP Limit) 

30 Each individual SNP was statistically analyzed for the degree 

to which it correlated with ^^delta %FEV1 pred.'^ The analysis was a regression 
analysis, correlating the number of occurrences of the SNP in each subject's 
genome (/. 0, 1 , or 2), with the value of "delta %FEV1 pred." 

*'Cut-ofP' criteria were applied to each SNP in turn, as 
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follows. In this example, a confidence limit of 0.05 was the default value for the 
tight cutoff, and a limit of 0. 1 was the default value of the loose cutoff. The default 
values were automatically entered into the screen shown in Fig. 39A, in the two 
boxes labeled "Confidence''. A SNP was then chosen fi*om among the SNPs present 
in the population, and the p value calculated for correlation of this SNP with delta 
%FEV1 pred, was tested against the tight cutoff. If the value was .05 or less, the 
SNP and associated correlation data were stored for later calculations and for 
display in the screen shoAvn in Fig. 39A, If the p value was between .05 and 0. 1 , the 
SNP and associated correlation data were stored without being displayed. Any SNP 
whose p value was greater than 0.1 was discarded, it was not considered fiirther 
in the process. All thirteen ADBR2 SNPs were selected and tested in turn. The 
individual SNPs at positions 3 and 9 passed the tight cut-off; these were saved for 
display in Fig. 3 9 A. In addition, the SNP at position 1 1 passed the loose cut-oif and 
15 was saved without display. 

All possible pair-wise combinations (sub-haplotypes) of the 
saved SNPs were then generated. The correlations of the newly generated two-SNP 
sub-haplotypes with delta %FEV1 pred. were calciilated by regression analysis, as 
2Q was done for the individual SNPs. The correlation of each sub-haplotype was tested 
in turn, as described above, discarding any sub-haplotypes whose p-value did not 
pass thie cut-off criteria and saving those that did pass, with those that passed the 
tight cut-off stored for display in the screen shown in Fig. 39A. The sub-haplotypes 
that passed the tight cut-off were ♦♦♦♦****a*G**, **a*****A****, and 
*«^4t*«4t4e4c4tQ3ie>it. thcsc wcrc savcd for display in Fig. 39A. No sub-haplotypes 
passed only the loose cut-ofif. 

When all the two-SNP sub-haplotypes had been examined, all 
pair-wise combinations between originally saved SNPs and saved two-SNP sub- 
30 haplotypes, and among the saved two-SNP sub-haplotypes, were generated. This 
produced a collection of three-SNP and four-SNP subhaplotypes. Again, 
correlations were calculated by regression. A single three-SNP sub-haplotype, 
** A*****A*G**, passed the tight cut-ofif and was saved for display, and no four- 
SNP sub-haplotype passed. No sub-haplotypes passed only the loose cut-off. 
Combinations between the saved three-SNP sub-haplotypes and the saved SNPs 



25 
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generated four-SNP subhaplotypes, none of which passed the tight cut-off. No new 
combinations were possible within the default limit (four) to the number of SNPs 
permitted in the generated sub-haplotypes. (See Fig. 39A, where "fixed site = 4" 
indicates the 4-SNP limit). 
5 The results of the build-up process are shown in Fig. 39A, 

where the SNPs and sub-haplotypes that passed the tight cut-off are displayed along 
with the results of the regression analyses. It was discovered that the three-SNP 
subhaplotype **a*****a,*q** a p-value nearly identical to that of the full 
haplotype. Figure 21b shows the regression line (response as a function of number 
of copies of haplotype **A*****A*G**), indicating that the more copies of this 
marker a patient has, the lower the response. 



b. Pare-Down Procedure (To 10 SNP Limit) 

Each of the twelve haplotypes observed for the ADBR2 gene 
is analyzed for the degree to which it correlates with the value of delta %FEV1 pred. 
by a regression analysis, correlating the nxunber of occurrences of the haplotype in 
the subject's genome, te, 0, 1, or 2, with the value of the clinical measurement. 

20 A *tight cut-off' criterion is then applied to. each haplotype in 

turn. A first haplotype is selected, and its correlation with delta %FEV1 pred. is 
tested against the tight cut-off of 0.05. If the value is .05 or less, the haplotype and 
associated correlation data are stored for later calculations and for display in the 

2 J screen shown in Fig. 39A. If the p value is between .05 and 0.1, the haplotype and 
associated correlation data are stored as well but are not displayed. Any haplotype 
whose p value is greater than 0.1 is discarded, i.e., it is not considered further in the 
process. All twelve ADBR2 haplotjrpes are selected and tested in turn. 

From the saved haplotypes, all possible sub-haplotypes in 

30 

which a single SNP is masked are generated by systematically masking each SNP of 
all saved haplotypes. The correlations of the newly generated sub-haplotypes with 
the clinical outcome value are calculated by regression, as was done for the 
haplotypes themselves. Each newly generated sub-haplotype is tested against the 
35 tight and loose cut-offs as described above for the haplotype correlations, discarding 
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sub-haplotypes that do not pass the cut-off criteria and saving those that do pass. 

When the first generation of sub-haplotypes, having a single 
SNP masked, has been tested, a second generation of sub-haplotypes having a two 
SNPs masked is generated from those of the first generation whose p-values passed 
5 the cut-offs. This is done, as before, by systematically masking each of the 

remaining SNPs. The p-values of the second generation of sub-haplotypes, having 
two SNPs masked, are tested, and from those that pass the cut-offs a third 
generation having three SNPs masked is generated. 

10 c. Cost Reduction 

The frequencies for each of the twelve haplotypes of the 
ADBR2 gene were calculated and were found to be as shown in Fig. 28A (eleven of 
the twelve, haplotypes are visible). A list of all 78 genotypes that could be derived 
15 from the 12 observed haplotypes was generated. A portion of the list is shown in 
Fig. 32. The expected frequency of each of these genotypes from the Hardy- 
Weinberg equilibrium was calculated, and is shown in the third column imder each 
population group. Linkage between the polymorphic sites was as shown in Fig. 33. 

A set of masks of the same length as the haplotype, i.e., 
thirteen sites in length, was created. A portion of the set of masks' is shown in Fig. 
34, along with a portion of the list of possible genotypes (haplotype pairs) which has 
been sorted by Hardy- Weinberg frequency. 

For each mask, an ambiguity score was calculated as follows: 
25 all pairs of genotypes [i j] that were rendered identical by imposition of the mask 

were noted, and the geometric mean of their Hardy- Weinberg frequencies (J\ and fj) 
was calculated. For each mask, all the geometric means of the frequencies of all the 
ambiguous pairs were added together, and the sum was multiplied by 10 to obtain 
3Q the ambiguity score for that mask: 



20 



35 



ambiguity score = 1 0^ yjfifj 

Ambiguity scores calculated in this manner are shown in Fig. 
34 to the right of each of the displayed masks, along with the genotype pairs 
rendered ambiguous by the mask. (The genotype numbers refer to the row numbers 
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in the first column of the sorted genotype list.) 

From the data visible in Fig. 34, it may be seen that one can 
mask sites 1, 6, 7, 8, and 10 (five of the thirteen polymorphic sites in the ADBR2 
gene) with an ambiguity score of only 0.072. This mask (sixteenth mask from the 
5 top) renders four genotypes (sets of haplotype pairs) ambiguous, and three of the 
four ambiguities are between common and rare haplotype pairs. It is thus 
discovered that a savings of about 38% in the variable cost of haplotyping this gene 
can be achieved, simply by measuring eight rather than all thirteen known 
polym„n.hic si^. a»d U^t .he complex haplotype c«, be i„fe,red wiU, high 
confidence from this smaller data set. 
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Applicants reserve the right to challenge the accuracy and pertinency of the cited 
references. 

20 

Modifications of the above described modes for carrying out 
the invention that are obvious to those of skill in the fields of chemistry, medicine, 
computer science and related fields are intended to be within the scope of the 
following claims. 



30 



35 



wo 01/01218 



-149- 



PCT/USOO/17540 



o 

TABLE OF CONTENTS 

I. TITLE OF THE INVENTION 1 

II. RELATED APPLICATIONS 1 

5 HI. FIELD OF THE INVENTION 1 

IV. BACKGROUND OF THE INVENTION 1 

V. SUMMARY OF THE INVENTION .....6 

VI. BRIEF DESCRIPTION OF THE DRAWINGS 10 

10 

VII. DETAILED DESCRIPTION OF THE INVENTION 22 

A. DEFINITIONS 22 

B. . METHODS OF IMPLEMENTING THE INVENTION 25 

15 C. CTS™ METHODS OF THE INVENTION 29 

1 , Illustration Using The CYP2D6 Gene 3 1 

2. Illustration With ADRB2 Gene „.... 54 

D. IMPROVED METHODS 60 

1 . Improved Method For Finding Optimal Genotsrping Sites .. 60 

2. Improved Methods For Correlating Haplotypes With Clinical 
Outcome Vatiable(s) 64 

a. Multi-SNP Analysis Method (Build-Up Process) .... 64 

b. Reverse SNP Analysis Method (Pare-Down Process) 
67 

E. TOOLS OF THE INVENTION 70 

F. DATA/DATABASE MODEL 71 

30 

1 , Database Model Version 1 72 

a. Submodels 72 

b. Abbreviations 73 

35 c. Tables 74 



-150- 



PCT/US00/17S40 



d. Fields 77 

2. Database Model Version 2 1 00 

a. Submodels 100 

b. Abbreviations * 107 

c. Tables 108 

d. Fields 1 1 1 

BUSINESS MODELS....- 128 

1 . Hap2000 Partnership 128 

a. Partnership Benefits 129 

i- Isogenomics™ Database 129 

ii. Informatics Computer Program 130 

iii. Cohort Haplotyping 130 

iv. Isogene Clones 131 

V. Gene Selection by Partners 131 

vi. Patent Dossier 131 

vii. Committed Liaison ..,131 

viii. Special Services: cDNAs and Genomic 
Intervals , 131 

b. Membership in the Partnership 132 

c. Envisioned Outcomes From The Partnership 132 

2. Mednostics Program 138 

EXAMPLE 1 139 

EXAMPLE 2 142 

1 . Provision Of Clinical Data 142 

2. Determination Of ADBR2 Genotypes And Haplotypes 143 

3. Correlation Of ADBR2 Haplotypes And Haplotype Pairs 
With Drug Response 143 

a. Build-Up Procedure (To 4 SNP Limit) 143 



wo 01/01218 



-151- 



PCT/USOO/17540 



b. Pare-Down Procedure (To 10 SNP Limit) 145 

c. Cost Reduction 146 

J. REFERENCES 147 

II. ABSTRACT OF THE INVENTION 212 



10 
15 

20 
25 
30 



35 



wo 01/01218 



-152- 



PCT/USOO/17540 



CLAIMS 

We claim: 

1 . A method of generating a haplotype database for a population, comprising 
5 data elements representative of the haplotypes for at least one locus from the 

individuals in the population, the method comprising: 

(a) for each individual in the population, generating polymorphism 
and haplot)^ data elements representative of the individuates 
10 polymorphisms and haplotypes for the locus; and 

1) (b) storing the polymorphism and haplotype data 

elements for the individuals in a computer-readable database, 
wherein the data elements are organized according to the spatial 
j5 relationships between the polymorphisms and haplotypes and a 

reference nucleotide sequence for the locus. 

2. The method of claim 1 , wherein the locus is a gene or a gene feature and 
the haplotype data elements represent haplotypes and haplotype pairs for the gene or 

2Q the gene feature. 

3. The method of claim 2, wherein the deriving step comprises ascertaining 
the frequency of the haplotypes and haplotype pairs according to the Hardy- 
Weinberg equilibrium. 

25 4. The method of claim 2, further comprising deriving the haplotype data 

elements by: 

(a) determining a nucleotide sequence of the gene or the gene 

feature from a first chromosome and a second chromosome in 
3Q each individual in the population to generate a plurality of 

nucleotide sequences for the population; 

(c) aligning the plurality of nucleotide sequences for the 
population; 

35 (d) identifying haplotypes from the aligned sequences; and 
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(e) selecting two haplotypes for each individual as a haplotype pair 
for storage in a table in the database. 

5. The method of claim 4, wherein the method further comprises validating 
the haplotype data. 

5 

6. The method of claim 5, wherein the validating comprises correcting an 
observed distribution of haplotypes or haplotype pairs for effects imposed by a 
limited number of individuals in the population. 

7. The method of claim 6, wherein the validating also comprises analyzing 
compliance of the observed distribution with Mendelian inheritance principles. 

8. The method of claim 1 , wherein the population is selected from the group 
consisting of a reference population, a clinical population, a disease population, an 
ethnic population, a family population and a same-sex population. 

IS 

9. A method of predicting the presence of a haplotype pair in an individual 
comprising: 

(a) identifying a genotype for the individual; 

20 0>) enumerating all possible haplotype pairs which are consistent 

with the genotype; 

(c) accessing a database containing reference haplotype pair 
frequency data to determine a probability, for each of the 

2g possible haplotype pairs, that the individual has a possible 

haplotype pair; and 

(d) analyzing the determined probabilities to predict haplotype 
pairs for the individual. 

30 1 0. The method of claim 9, wherein the identifying step comprises 

determining the most predictive genotyping site or sites. 

1 1 . The method of claim 10, wherein the determining includes calculating 
phylogenetic and/or link^e information for the reference haplotype pairs. 
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12. The method of claim 10, wherein the enumerating step comprises listing 
the possible haplotype pairs in order of their frequency in the database. 

13. A method for identifying a correlation between a haplotype pair and a 
clinical response to a treatment, or other phenotype, comprising: 

5 

(a) accessing a database containing data on clinical responses to 
treatments, or other phenotypes, exhibited by a clinical 
population; 

(b) selecting a candidate locus hypothesized to be associated with 
the clinical response or other phenotype, the locus comprising 
at least two polymorphic sites; 

(c) providing haplotype data for each member of the clinical 
population, the haplotype data comprising information on a 
plurality of polymorphic sites present in the candidate locus; 

(d) storing the haplotype data; and 

(e) calculating the degree of correlation between haplotype pairs 
and the clinical response to a treatment, or other phenotype, by 
statistically analyzing the haplotype and clinical response data. 

14. The method of claim 13 wherein step (e) is performed last. 

15. The method of claim 13 wherein step (a) is performed before any one of 
2 J steps (b), (c ) or (d). 

16. The method of claim 13 wherein step (a) is performed after steps (b), (c) 
and (d). 

17. The method of any one of claims 13-16, wherein the treatment comprises 
30 administration of a drug or drug candidate. 

1 8. The method of claim 1 7, wherein the candidate locus is a gene or a gene 
feature. 

19. The method of claim 18, further comprising displaying or outputting the 
35 correlation. 
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20. The method of claim 19, further comprising calculating the statistical 
significance of the correlation. 

21 . The method of claim 20, wherein the providing haplotype data step 
comprises 

(a) providing a genotype for the individual; 

(b) enumerating all possible haplotype pairs which are consistent 
with the genotype; 

(c) determining a probability for each possible haplotype pair that 
the individual has that possible haplotype pair, by accessing a 
database containing frequency data for haplotype pairs in a 
reference population; and 

(d) analyzing the determined probabilities to infer the individual's 
haplotype pair. 

22. A method for identifying a correlation between a haplotype pair and 
susceptibility to a condition or disease of interest, or other phenotype of interest, 
comprising the steps of: 

(a) selecting a candidate locus hypothesized to be associated with 
the phenotype, condition or disease of interest, the locus 
comprising at least two polymorphic sites; 

(b) providing haplotype data for the candidate locus for each 
member of a population having the phenotype, condition or 
disease of interest ("disease haplotype data"); 

(c) organizing the disease haplotype data in a database; 

(d) statistically analyzing the disease haplotype data to calculate 
haplotype pair frequencies; 

(e) accessing a database containing haplotype data for the 
candidate locus for each memb^ of a healthy reference 
population ("reference haplotype data"); 
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(f) Statistically analyzing the reference haplotype data to calculate 
haplotype pair frequencies; and 

(g) when a haplotype pair has a higher frequency in the population 
having the phenotype, condition or disease of interest than in 
the healthy reference population, identifying a correlation of the 
haplotype pair with susceptibility to the disease or condition of 
interest. 

23. The method of claim 22 wherein step (f) is performed after step (d). 

24. The method of claim 22 wherein step (e) is performed before any one of 
steps (b), (c), or (d), 

25. The method of claim 22 wherein step (e) is performed after any one of 
steps (b), (c), or (d). 

26. The method of any one of claims 22-25, wherein the candidate locus is a 
gene or a gene feature. 

27. The method of claim 26, further comprising displaying or outputting the 
identified correlation. 

28. The method of claim 27, further comprising calculating the statistical 
significance of the identified correlation. 

29. The method of claim 28, wherein the providing haplotype data step 
comprises: 

(a) providing a genotype for the individual; 

(b) enumerating all possible haplotype pairs which are consistent 
with the genotype; 

30 (c) for each possible haplotype pair, determining the probability 

that the individual has that haplotype pair, by accessing a 
database containing frequency data for haplotype pairs in a 
reference population; and 
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(d) inferring the individual's haplotype pair based on the 
determined probabilities. 

30. A method of predicting an individual's response to a medical or 
pharmaceutical treatment, comprising: 

(a) selecting at least one candidate gene for which a correlation 
between haplotype content and response to the treatment has 
been identified; 

(b) determining the haplotype pair of the individual for the 
candidate gene or genes; and 

(c) predicting that the individual's response will be the response 
associated haplotype pair with information on the correlation. 

31 . The method of claim 30, wherein the selecting step comprises outputting a 
list of candidate genes associated with different responses to the treatment. 

32. The method of claim 31, further comprising storing the haplotype pair. 

33. The method of claim 32, further including generating an error estimate. 

20 34. A computer implemented method for generating a gene structure screen 

for display on a display device, comprising the steps of: 

(a) retrieving from a database and displaying in a first area data 
indicative of the frequencies of occurrence of a gene's 

25 haplotypes vrfthin predetermined member groupings of a 

reference population; 

(b) retrieving from a database and displaying in a second area data 
indicative of the frequencies of occurrence of particular 

30 nucleotides for the member groupings; 

(c) retrieving from a database data indicative of gene structure; 

(d) displaying in a third area a graphical representation of gene 
structure that identifies polymorphic sites on the gene; 

35 
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(e) selecting one of the polymorphic sites to cause the appropriate 
nucleotide frequencies to be displayed in the second area. 

35. A computer implemented method for generating a haplotype pair 
frequency screen for display on a display device, comprising the steps of: 

5 

(a) displaying in a first area a plurality of selectable items each 
corresponding to a polymorphic site for a predetermined gene; 

(b) selecting one or more of said selectable items; 

(c) displaying in a second area the haplotype pairs occurring in a 
reference population for the selected polymorphic sites; 

(d) displaying in a third area data indicative of haplotype 
frequencies for a plurality of member groupings within the 
population. 

15 

36. A computer implemented method for generating a linkage screen for 
display on a display device, comprising the steps of: 

(a) displaying in a first area a graphical scale showing a reference 
2Q for determining progressive degrees of linkage between 

polymorphic sites in a population; 

(b) displaying in a second area a graphical matrix structure having 
a plurality of grids, where each axis of the structure represents 

2^ polymorphic sites on a gene; and where each grid graphically 

displays an indication of degree of linkage between 
polymorphic sites corresponding to that grid, in accordance 
with the reference shown in the first area. 



30 



37. The method of claim 36, wherein color is used as the indication of degree 
of linkage. 

38. A computer implemented method for generating a phylogenetic tree screen 
for display on a display device, comprising the steps of: 
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(a) displaying in a first area a plurality of selectable items each 
corresponding to a polymorphic site for a predetermined gene; 

(b) selecting one or more of said selectable items; 

(c) displaying in a second area a phylogenetic tree structure having 
nodes for each haplotype in a population, vAieTC the distance 
between nodes is indicative of the number of nucleotides that 
would have to be flipped to change one haplotype into another. 

39. The method of claim 38, wherein the nodes are connected by links that 
indicate a single nucleotide difference between nodes. 

40. The method of claim 39, wherein the nodes each display an indication of 
ethnogeogmphic frequency of occurrence of the haplotype represented by the node. 

41 . A computer implemented method for generating a genotype analysis 
screen for display on a display device, comprising the steps of: 

(a) displaying a first plurality of selectable items each 
corresponding to a polymorphic site, and a plurality of second 
selectable items each corresponding to a polymorphic site; 

(b) displaying a graphical scale showing a reference for 
determining progressive degrees of haplotype identification 
reliability using genotyping; 

(c) displaying a graphical matrix structure having a plurality of 
grids, where each axis represents a haplotype indicated by the 
first selectable items; and where each grid graphically displays 
an indication of degree of identification reliability for 
identifying the haplotype corresponding to that grid using 
genotyping specified by the second selectable items, in 
accordance with the reference. 

42- The method of claim 41, wherein the indication of degree is color, 

43. A method of displaying clinical response values of a subject population as 
a fiinction of haplotype pairs of the individuals in the population, comprising: 
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(a) receiving from a computer-readable storage device, data 
representing haplotype pairs and clinical response values for 
the subject population; 

(b) graphically displaying a haplotype pair matrix each of whose 

^ cells contains a graphical representation of the clinical response 

values of individuals having the haplotype pair corresponding 
to that cell of the haplotype pair matrix. 

44. A method of displaying clinical response values of a subject population as 
10 a function of haplotype pairs of the individuals in the population, comprising: 

(a) displaying one or more first selectable items representing 
polymorphic sites for a predetermined gene, which when 
selected, will generate haplotype pairs; 

(b) displaying a second selectable item representing a clinical 
response measurement; which, when selected in conjunction 
with the first selectable items will cause display of a haplotype 
pair matrix, each of whose cells contains a graphical 

20 representation of the clinical response values for the selected 

clinical measurement of individuals having the haplotype pair 
corresponding to that cell of the haplotype pair matrix. 

45. The method of claim 43 or 44, wherein the graphical representation of 
25 clinical response values is a color scale or gray scale, the shade of each cell being 

proportional to the mean clinical response value of individuals having the haplotype 
pair corresponding to that cell of the haplotype pair matrix. 

46. The method of claim 45, further comprising displaying a means for 

2Q adjusting the range of mean clinical response values represented by the color scale 
or gray scale, wherein adjustment of the range causes the displayed shade of color 
or gray of the cells of the haplotype pair matrix to be adjusted accordingly. 

47. The method of claim 43 or 44 wherein the graphical representation of data 
is a histogram indicating the distribution of individuals across the range of clinical 
response values. 
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48. The method of any one of claims 43, 44, or 45 wherein at least one cell 
includes a selectable area which, when selected, will cause the display of a 
histogram indicating the distribution of individuals across the range of clinical 
response values. 

^ 49. The method of any one of claims 43, 44 or 45 which further comprises 

displaying a selectable item which, when selected, causes the display of the 
statistical significance of the correlations between variation at individual 
polymorphic sites and the clinical response values. 

10 50. The method of claim 43, 44 or 45 which further comprises displaying a 

selectable item which, when selected, displays the numerical mean and standard 
deviation of clinical response values among individuals having each haplotype pair 
in. the matrix. 

15 51. The method of claim 43, 44 or 45 which further comprises displaying a 

selectable item which, when selected, causes the display of the results of an analysis 
of variation calculation to permit determination of whether variation in the clinical 
response values between individuals having different haplotype pairs is statistically 
significant. 

20 

52. A computer-implemented method for carrying out a genetic algorithm for 
finding an optimal set of weights to fit a fimction of polymorphic site data to a 
clinical response measurement comprising: 

25 (a) displaying a variable controller for setting the mmiber of 

genetic algorithm generations parameter; 

(b) displaying a variable controller for setting the number of agents 
'parameter; 

30 (c) displaying a variable controller for setting the mutation rate 

parameter; 

(d) displaying a variable controller for setting the crossover rate 
parameter; 
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(e) displaying one or more selectable items each corresponding to 
a polymorphic site of a predetermined gene; and 

(f) displaying a selectable item for initiation of the genetic 
algorithm calculation; 

5 

wherein selection of one or more selectable items corresponding to a polymoiphic 
site, and selection of the item for initiation of the genetic algorithm calculation, 
results in the execution of the genetic algorithm calculation with the parameters set 
by the variable controllers, and the display of the residual error of the model as a 
10 function of the number of genetic algorithm generations and a display of the results 
of the genetic algorithm calculation showing the optimal weights for each of the 
polymorphic sites. 

S3. A computer-implemented method for displaying correlations between 
15 clinical outcome values for a selected population, comprising: 

2) (a) displaying a first plurality of selectable items 
corresponding to the clinical outcome variables; 

3) (b) displaying a second plurality of selectable items 
^® corresponding to the clinical outcome variables; and 

4) (c) displaying a scatter plot of data points corresponding 
to the individuals in the selected population; 



25 



5) wherein selecting first item from the first plurality of selectable items 

causes each data point to be plotted on the x axis of the scatter plot 
according to the value of the corresponding clinical outcome value for the 
individual associated with the data point, and wherein selection of a second 
item from the second plurality of selectable items causes each data point to 
30 be plotted on the y axis of the scatter plot according to the value of the 

corresponding clinical outcome value for the individual associated with the 
data point. 

54. A method for conducting a clinical trial of a treatment protocol for a 
35 medical condition of interest, comprising: 
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(a) selecting one or more genes (or other loci) known or expected 
to be involved in a particular disease or drug response; 

(b) defining a reference population of healthy individuals with a 
broad and representative genetic background; 

5 

(c) sequencing DNA fiom each member of the reference 
population; 

(d) determining the haplotypes for each of the selected genes (or 
other loci) for each member of the reference population; 

10 

(e) detemiining the frequencies, population distributions and 
statistical measures, including confidence limits, for each of the 
determined haplotypes; 

(f) recruiting a trial population of individuals who have the 
medical condition of interest; 

(g) treating individuals in the trial population according to the 
treatment protocol, and measuring their response to treatment; 

20 (h) determining the haplotypes for each of the selected genes (or 

other loci) for each member of the trial population; 

(i) determining the correlations between individual responses to 
the treatment and individual haplotype content for each of the 
2j selected genes (or other loci); and 

(j) from these correlations, constructing a model that predicts the 
response of an individual to the treatment, given the 
individual's haplotype content. 

30 55. The method of claim 54, further comprising the step of deriving from the 

haplotype distribution found for the reference population a reduced set of 
genotyping markers, which allow an individual's haplotypes to be accurately 
predicted without conducting a complete molecular haplotype analysis, and using 
the reduced set of genotype markers to determine haplotypes in step (h). 
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56. A method of inferring genotypes of individual subjects for a selected gene 
having at least m polymorphic sites, comprising 

(a) providing a database of m-site haplotypes of the selected gene 
from a representative cohort of individuals; 

5 

(b) tabulating the frequency of occurrence for each of the 
haplotypes; 

(c) constructing a list of all genotypes that could result from all 
possible pairs of observed haplotypes; 

10 

(d) calculating the expected frequency of these genotypes 
assuming the Hardy- Weinberg equilibrium; 

(e) generating a complete set of all possible masks of the same 
length m as the haplotypes, wherein each mask blocks the 
identity of the nucleotides at m-n polymorphic sites and admits 
the identity of nucleotides at the other n sites; 

(f) for each mask» calculating how much ambiguity results from 
genotyping with only the n polymorphic sites whose identity is 

20 

admitted by the mask; 

(g) from among those masks having an acceptable level of 
ambiguity, selectiiig a mask which has the lowest value of /i; 

2j (h) genotyping the subjects by measuring only the n polymorphic 

sites that are admitted by the selected mask; and 

(i) assigning to each subject having a particular n-site haplotype, 
the fiill w-site haplotype of a member of the initial cohort 
having the same w-site haplotype. 

57. The method of claim 56, wherein the calculation of ambiguity for a mask 
comprises 

(a) identifying all pairs of genotypes that are rendered identical by 
35 application of the mask; 
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(b) calculating the geometric mean of the calculated Hardy- 
Weinberg frequencies of each pair of genotypes identified in 
step (a); 

(c) summing all such geometric means for all ambiguous pairs to 
^ obtain an ambiguity score for the mask. 

58. The method of either of claims 56 or 57, wherein, if application of the 
selected screen causes an ambiguity in that two haplotype pairs A and B exist that 
could explain a given genotype, and the Hardy- Weinberg equilibrium predicts 

10 probabilities pa and pe, where pA + Pb = U the assignment of a haplotype pair is 
carried out by a process comprising 

(a) selecting a random number between 0 and 1 ; 

(b) if the random number is less than or equal to pa, assigning the 
^'^ haplotype pair A; and 

(c) if the number is greater than Pa, assigning the haplotype pair B. 

59. A method of determining polymorphic sites or sub-haplotypes that 
correlate with a clinical resfponse or outcome of interest, comprising: 

20 

(a) providing haplotype information, and clinical response or 
outcome data (clinical outcome values) fix»m a cohort of 
subjects; 

2j Qoi) statistically analyzing each individual SNP in the haplotype for 

the degree to which it correlates with the clinical outcome 
values, and generating a numerical measure of the degree of 
correlation; 

(c) saving for further processing those individual SNPs whose 
numerical measure of the degree of correlation with the clinical 
outcome values exceeds a first cut-off value; 

(d) generating all possible pair--wise combinations of the saved 
SNPs so as to provide a set of «-site sub-haplotypes where n = 
2; 
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(e) statistically analyzing each newly generated w-site sub- 

haplotype for the degree to which it correlates with the clinical 
outcome values and calculating a numerical measure of the 
degree of correlation; 

^ (f) saving for fiirther processing those w-site sub-haplotypes whose 

numerical measure of the degree of correlation with the clinical 
outcome values exceeds the first cut>ofF value; 

(g) generating all possible pair-wise combinations among and 

10 between the saved SNPs and saved sub-haplotypes, to produce 

new subhaplotypes v^th increased values of n; 

(h) repeating steps (e) through (g) until either (i) no new sub- 
haplotypes can be generated, or (it) no further sub-haplotypes 

IS having n less than a pre-selected limit can be generated. 

60. The method of claim 59, further comprising the step of displaying those 
saved SNPs and sub-haplotypes whose numerical measure of the degree of 
correlation with the clinical outcome value exceeds a second cut-off value, wherein 

20 the second cut-off value is greater than the first cut-off value. 

61 . The method of claim 59, wherein the numerical measure of degree of 
correlation is replaced by the p-value for the correlation, and SNPs and sub- 
haplotypes are saved if the p-value is less than a first cut-off value. 

25 62. The method of claim 61, further comprising the step of displajdng those 

saved SNPs and sub-haplotypes whose p-value for the correlation with the clinical 
outcome value is less than a second cut-off value, wherein the second cut-off value 
is less than the first selected value. 

30 63. The method of any one of claims 59-62, fiirther comprising the step of 

excluding from fiirther processmg complex subhaplotypes which are constructed 
from smaller sub-haplotypes, where the smaller sub-haplotypes each have 
. correlation values that are at least as significant as that of the complex sub- 
2^ haplotype. 
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64. A method of determining polymorphic sites or sub-haplotypes that 
correlate with a clinical response or outcome of interest, comprising: 

(a) providing single gene haplotype information for one or more 
genes, and clinical response or outcome data, from a cohort of 

^ subjects; 

(b) statistically analyzing each smgle gene haplotype for the degree 
to which it correlates with the clinical response or outcome of 
interest, and calculating a numerical measure of the degree of 

10 correlation; 

(c) saving for fiirther processing those haplotypes whose numerical 
measure of the degree of correlation with the clinical response 
or outcome of interest exceeds a first selected value; 

(d) for each haplotype composed of m polymorphic sites, 
generating all possible sub-haplotypes having a single site 
masked, so as to provide a set of sub-haplotypes having {m-n) 
sites, where « = 1 ; 

20 (e) statistically analyzing each newly generated sub-haplotype for 

the degree to which it correlates with the clinical response or 
outcome of interest, and calculating a numerical measure of the 
degree of correlation; 

25 (f) saving for further processing those sub-haplotypes whose 

numerical measure of the degree of correlation with the clinical 
response or outcome of interest exceeds the first selected value; 

(g) firom the saved sub-haplotypes, generating all possible sub- 
30 haplotypes having one additional site masked; 

(h) repeating steps (e) through (g) until either (i) no new sub- 
haplotypes have a degree of correlation which exceeds the first 
selected value, or (ii) no further sub-haplotypes having more 

35 unmasked sites than a pre-selected limit can be generated. 
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65. The method of claim 64, further comprising the step of displaying those 
saved sub-haplotypes whose numerical measure of the degree of correlation with the 
clinical response or outcome of interest exceeds a second selected value, wherein 
the second selected value is greater than the first selected value. 

^ 66, The method of claim 64, wherein the numerical measure of degree of 

correlation is replaced by the p- value for the correlation, and sub-haplotypes are 
saved if the p-value is less than a fi3st selected value. 

67. The method of claim 66, further comprising the step of displaying those 
10 saved sub-haplotypes whose p-value for the correlation with the clinical response or 

outcome of interest is less than a second selected value, wherein the second selected 
value is less than the first selected value. 

68. The method of any one of claims 64-67, fiirther comprising the step of 
IS excluding fit)m fiirther processing complex subhaplotypes which are constructed 

from smaller sub-haplotypes, where each of the smaller sub-haplotypes has 
correlation values that are at least as significant as that of the complex sub- 
haplotype. 

20 69. A computer-usable medium having computer-readable program code 

stored thereon, for causing a computer to adjust observed haplotype pair frequencies 
within a population group, said haplotype pair firequencies being stored in a 
computer-readable database of haplotype information for a gene or gene feature of 
interest, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
access said database and generate all possible haplotype pairs 
consistent with the stored genotypes; 

2Q (b) computer-readable program code for causing a computer to 

calculate the expected frequency of the generated haplotypes and 
haplot3^e pairs according to the Hardy- Weinberg equilibrium, based 
upon the observed distribution of haplotypes or haplotype pairs in the 
population; and 

35 

(c) computer-readable program code for causing a computer to 
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select the most probable haplotype pair for the individual based on 
the observed. 

70. The computer-usable medium of claim 69, further comprising computer- 
readable program code stored thereon for causing a computer to correct the stored 

^ distribution of haplotypes or haplotype pairs for effects imposed by the presence of 
a limited number of individuals in the population. 

71 . The computer^usable medium of clmm 69, further comprising computer- 
readable program code stored thereon for causing a computer to validate haplotype 

10 pair assignments by analyzing for compliance of the assigned haplotype pair with 
Mendelian inheritance principles. 

72. The computer-usable medium of claim 69, wherein the population is 
selected from the group consisting of a reference population, a clinical population, a 

IS disease population, an ethnic population, a family population and a same-sex 
population. 

73. A computer-usable medium having computer-readable program code 
stored thereon, for causing haplotype pair assignments to be made to an individual 

20 member of a population whose genotype information for a gene or gene feature of 
interest is stored in a computer-readable form, the computer-readable program code 
comprising: 

(a) computer-readable program code for causing a computer to 
25 generate all possible haplotype pairs consistent with the stored 

genotype; 

(b) computer-readable program code for causing a computer to 
access a database containing reference haplotype pair frequency 
data and to determine from the frequency data the probability, 
for each of the possible haplotype pairs, that the individual has 
the possible haplotype pair; and 

(c) computer-readable program code for causing a computer to 
select the most probable haplotype pair for the individual. 
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74. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to identify a correlation between a clinical 
response to a treatment or other phenotype and a haplotype or haplotype pair present 
at a candidate locus hypothesized to be associated with the clinical response other 

5 phenotype, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
access a database containing data on clinical responses to 
treatments, or other phenotypes, exhibited by individuals in a 

IQ clinical population; 

(b) computer-readable program code for causing a computer to 
access a database containing haplotype data for each individual 
of the clinical population, the haplotype data comprising 
information on a plurality of polymorphic sites present at the 
candidate locus; and 

(c) computer-readable program code for causing a computer to 
calculate the degree of correlation between haplotype pairs and 
the clinical response to the treatment or other phenotype, by 
statistical analysis of the haplotype and clinical response data. 

75. The computer-usable medium of claim 74, wherein the treatment 
comprises administration of a drug or drug candidate. 

23 76. The computer-usable medium of claim 74, wherein the candidate locus is 

a gene or a gene feature. 

77. The computer-usable medium of claim 74, further comprising computer- 
readable program code stored thereon for causing a computer to store, display, or 

3Q output the degree of correlation. 

78. The computer-usable medium of claim 74, further comprising computer- 
readable program code stored thereon for causing a computer to calculate the 

. statistical significance of the correlation. 
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79. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to identily a correlation between an 
individual's susceptibility to a condition or disease of interest, or other phenotype, 
and a haplotype or haplotype pair present at a candidate locus hypothesized to be 

5 associated with susceptibility to the condition or disease of interest, or with a 
phenotype of interest, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
access haplotype data for the candidate locus for each member 

Q of a population having the phenotype or condition or disease of 

interest ("disease haplotype data"); 

(b) computer-readable program code for causing a computer to 
statistically analyze the disease haplotype data to calculate 

J haplotype or haplotype pair frequencies; 

(c) computer-readable program code for causing a computer to 
access a database containing haplotype data for the candidate 
locus for each member of a healthy reference population 

^ ("reference haplotype data"); 

(d) computer-readable program code for causing a computer to 
statistically analyze the reference haplotype data to calculate 
haplotype or haplotype pair frequencies; and 

^ (e) computer-readable program code for causing a computer to 

identify a correlation of a haplotype or haplotype pair with 
susceptibility to the disease or condition of interest, or with the 
phenotype of interest, when the haplotype or haplotype pair has 
a higher frequency in the population having the phenotype, 

0 

condition or disease of interest than in the reference population. 

80. The computer-usable medium of claim 79, wherein the candidate locus is 
a gene or a gene feature. 
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81. The computer-usable medium of claim 79, further comprising computer- 
readable program code stored thereon for causing a computer to store, display, or 
output the identified correlation, 

82. The computer-usable medium of claim 79, further comprising computer- 
5 readable program code stored thereon for causing a computer to calculate the 

statistical significance of the correlation. 

83. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to predict an individual's response to a 

10 medical or pharmaceutical treatment based on one or more selected haplotypes or 
haplotype pairs of the individual, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
access a database of correlations between haplotypes or 

IS haplotype pairs and responses to the medical or pharmaceutical 

treatment in a reference population; 

(b) computer-readable program code for causing a computer to 
locate haplotypes or haplotype pairs in the database that match 

20 *e selected haplotype pairs of the individual, and 

(c) computer-readable program code for causing a computer to 
predict that the individual's response will be the response or 
responses associated in the database with the selected haplotype 

25 or haplotype pair. 

84. The computer-usable mediiun of claim 83, further comprising computer- 
readable program code stored thereon for causing a computer to generate an error 
estimate for the prediction. 

30 85. A computer-usable medium having computer-readable program code 

stored thereon, for causing a computer to display a gene's structure and gene 
features on a display device, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
35 retrieve from a database, and display in a first area of the 
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display device, data indicative of the frequencies of occurrence 
of a gene's haplotypes within predetermined member groupings 
of a reference population; 

(b) computer-readable program code for causing a computer to 

^ retrieve from a database data indicative of the gene's structure 

and gene features; 

(c) computer-readable program code for causing a computer to 
display in a second area of the display device a graphical 

10 representation of the gene's structure, user-selectable items 

indicating the location of gene features, and graphical 
indicators of the location of polymorphic sites on the gene; 

(d) computer-readable program code for causing a computer to 
15 display in a third area of the display device, in response to a 

user's selection of an item indicating a gene feature, a graphical 
representation of the structure of the gene feature having user- 
selectable items indicating the position of polymorphic sites; 

20 

(e) computer-readable program code for causing a computer to 
retrieve from a database, and display in a third area of the 
display device, in response to a user's selection of an item 
indicating the position of a polymorphic site, data indicative of 
the frequencies within the member groupings of the occurrence 
of particular nucleotides at the polymorphic site. 

86. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to display on a display device haplotype pair 

30 

frequency data within a population of individuals, for a selected gene or gene 
feature, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
display on the display device a plurality of selectable items. 
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each item corresponding to a polymorphic site in the gene or 
gene feature; 

(c) computer-readable program code for causing a computer to 
retrieve from a database and display on the display device, in 

^ response to a user's selection of one or more items indicating 

polymorphic sites, individual haplotype pairs in the database 
that differ at one or more of the selected polymorphic sites; and 

(d) computer-readable program code for causing a computer to 
10 display on the display device data indicative of the frequencies 

of the displayed haplotype pairs within one or more member 
groupings within the population. 

87. A computer-usable medium having computer-readable program code 
IS stored thereon, for causing a computer to display on a display device polymorphic 

site linkage data for a gene or gene structure of interest, the computer-readable 
program code comprising: 

(a) computer-readable program code for causing a computer to 
20 display on the display device one or more matrix structures, 

wherein the axes of each matrix structure represent the 
polymorphic sites in the gene or gene feature of interest, and 
wherein each matrix structure corresponds to a different 
population or population group; and 

(b) computer-readable program code for causing a computer to 
display on the display device, in each cell of a matrix structure, 
a graphical indication of degree of linkage between the twp 
polymorphic sites corresponding to the coordinates of the cell 
in the matrix. 

88. The computer-usable medium of claim 87, wherein color is used as the 
graphical indication of degree of linkage, and wherein the mediimi further 
comprises computer-readable program code stored thereon for causing a computer 
to display a reference color scale relating color to degree of linkage. 
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89. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to display on a display device a phylogenetic 
tree, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 

^ display a plurality of selectable items, each corresponding to a 

polymorphic site in the gene or gene feature of interest; and 

(b) computer-readable program code for causing a computer to 
display a phylogenetic tree structure having a node for each 

10 haplotype in a population, where the distance between nodes is 

proportional to the minimum number of nucleotides that would 
have to be changed to interconvert the corresponding 
haplotypes. 

15 90. The computer-usable medium of claim 89, further comprising computer- 

readable program code stored thereon for causing a computer to display connections 
between the nodes that indicate a single nucleotide difference between the 
haplotypes repesented by the nodes. 

20 91. The computer-usable medium of claim 89, further comprising computer- 

readable program code stored thereon for causing a computer to display at each 
node an indication of the relative frequency of occurrence of the haplotype 
represented by the node among different population groups. 

25 92. A computer-usable medium having computer-readable program code 

stored thereon, for causing a computer to display a genotype analysis screen on a 
display device, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
2Q display a first plurality of selectable items, each corresponding 

to a polymorphic site, and a second plurality of selectable 
items, each corresponding to a polymorphic site; 

(b) computer-readable program code for causing a computer to 
2^ display on the display device a matrix structure, wherein the 

axes of the matrix structure represent haplotypes in the gene or 
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gene feature of interest that vary at the polymorphic sites 
selected from the first plurality of selectable items; and 

(c) computer-readable program code for causing a computer to 
display on the display device, in each cell of the matrix 
^ structure, a graphical indication of the reliability of the 

assignment to an individual of the haplotype pair corresponding 
to the coordinates of the cell in the matrix, when the individual 
is genotyped only at the polymorphic sites selected from the 
IQ second plurality of selectable items. 

93. The computer-usable medium of claim 92, wherein color is used as the 
graphical indication of reliability of haplotype pair assignment, and wherein the 
medium further comprises computer-readable program code stored thereon for 
causing a computer to display a reference color scale relating color to reliability of 
haplotype pair assigtiment. 

94, A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to display clinical response values, or other 
phenotype data, of a subject population as a function of haplotype pairs of the 
individuals in the population, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
retrieve from a computer-readable storage device, data 
representing haplotype pairs and clinical response values, or 
other phenotype data, for the subject population; and 

(b) computer-readable program code for causing a computer to 
graphically display a haplotype pair matrix structure, each of 
whose cells contains a graphical representation of the clinical 
response values or other phenotype data of individuals having 
the haplotype pair corresponding to the coordinates of that cell 
in the haplotype pair matrix. 
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95. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to display on a display device clinical 
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response values, or other phnotypic data, of a subject population as a function of the 
haplotype pairs of the individuals in the population for a gene or gene feature of 
interest, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
^ display one or more first selectable items representing 

polymorphic sites of the gene of gene feature; 

(b) computer-readable program code for causing a computer to 
display one or more second selectable items representing 

10 clinical measurements or phenotypes; and 

(c) computer-readable program code for causing a computer to 
display on the display device, in response to the selection by 
the user of at least one first and second selectable items, a 

15 haplotype pair matrix structure, wherein the axes of the matrix 

structure represent haplotypes in the gene or gene feature of 
interest that vary at the polymoxphic sites corresponding to the 
first selected item or items, and wherein each of the cells of the 
matrix contains a graphical representation of the mean clinical 
response value, or other phenotype data, for the clinical 
measurement represented by the selected second item, of 
individuals having the haplotype pair corresponding to the 
coordinates of the cell in the haplotype pair matrix. 

96. The computer-usable medium of claim 94 or 93, wherein color is used as 
the graphical indication of mean clinical response value, or other phenotype data, 
and wherein the medium further comprises computer-readable program code stored 
thereon for causing a computer to display a reference color scale relating color to 

30 mean clinical response value. 

97. The computer-usable medium of claim 96, wherein the medium further 
comprises: 
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(a) computer-readable program code stored thereon for causing a 
computer to display a means for adjusting the range of mean 
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clinical response values or other phenotype data represented by 
the reference color scale; and 

(b) computer-readable program code stored thereon for causing a 
computer, in response to the adjustment of the range of clinical 
5 response values or other phenotype data represented by the 

reference color scale, to adjust the color of the cells of the 
haplotype pair matrix. 

98. The computer-usable medium of claim 94 or 95, vrherein the graphical 

10 representation of data is a histogram indicating the distribution of individuals across 
the range of clinical response values or other phenotype data. 

99. The computer-usable medium of any one of claims 94, 95, or 96, wherein 
at least one cell in the displayed matrix includes a selectable area, and wherein the 

15 medium further comprises computer-readable program code stored thereon for 

causing a computer to display, for individuals having the haplotype pair represented 
by the coordinates of the cell in the matrix, a histogram indicating the distribution of 
the individuals across the range of clinical response values. 

20 1 00. The computer-usable medium of aiiy one of claims 94, 95, or 96, 

which further comprises computer-readable program code stored thereon for 
causing a computer to display a third selectable item, and computer-readable 
program code stored thereon for causing a computer to display, in response to 
selection of the third selectable item by the user, the statistical significance of the 

25 

correlations between variation at individual polymorphic sites and the clinical 
response values. 

101. The computer-usable medium of any one of claims 94, 95, or 96, 
which further comprises computer-readable program code stored thereon for 

30 

causing a computer to display a fourth selectable item, and computer-readable 
program code stored thereon for causing a computer to display, in response to 
selection of the fourth selectable item by the user, the numerical mean and standard 
deviation of clinical response values among individuals having each haplotype pair 
35 in the matrix. 
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102. The computer-usable medium of any one of claims 94, 95, or 96, 
which further comprises computer-readable program code stored thereon for 
causing a computer to display a fifth selectable item, and computer-readable 
program code stored thereon for causing a computer to display, in response to 
5 selection of the fifth selectable item by the user, the results of an analysis of 

variation calculation to permit determination of whether variation in the clinical 
response values between individuals having different haplotype pairs is statistically 
significant. 

jQ 103. A computer-usable medium having computer-readable program code 

stored thereon, for causing a computer to carry out a genetic algorithm for finding 
an optimal set of weights to fit a fiinction of polymorphic site data for a gene or 
gene feature of interest to a clinical response measurement, the computer-readable 
program code comprising: 

(a) computer-readable program code for causing a computer to 
display a variable controller for setting the number of genetic 
algorithm generations parameter; 

(b) computer-readable program code for causing a computer to 
display a variable controller for setting the number of agents 
parameter; 

(c) computer-readable program code for causing a computer to 
display a variable controller for setting the mutation rate 
parameter; 

(d) computer-readable program code for causing a computer to 
display a variable controller for setting the crossover rate 
parameter; 

(e) computer-readable program code for causing a computer to 
display one or more selectable items each corresponding to a 
polymorphic site of the gene or gene feature of interest; and 
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(f) computer-readable program code for causing a computer to 
displaying a selectable item for initiation of the genetic 
algorithm calculation; and 

(g) computer-readable program code for causing a computer, in 
response to the selection by the user of one or more selectable 
items corresponding to a polymorphic site, and selection by the 
user of the item for initiation of the genetic algorithm 
caclulation, to execute the genetic algorithm calculation with 
the parameters set by the variable controllers, and to display on 
a display device (i) the residual error of the model as a function 
of the number of genetic algorithm generations, and (ii) the 
results of the genetic algorithm calculation showing the optimal 
weights for each of the polymorphic sites. 

1 04. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to display on a display device correlations 
between clinical outcome values obtained from selected clinical outome measures 
for a selected population, the computer-readable program code comprising: 

6) (a) ' computer-readable program code for causing a 
computer to display a first plurality of selectable items 
corresponding to clinical outcome measurements; 

7) (b) computer-readable program cod^ for causing a 
computer to display a second plurality of selectable items 
corresponding to clinical outcome measurements; and 

8) (c) computer-readable program code for causing a 
computer to display a scatter plot of data points, each data point 
corresponding to an individual in the selected population; 

9) (d) computer-readable program code for causing a 
computer, in response to selection by the user of an item from 
among the first plurality of selectable items, to locate each data 
point along the x axis of the scatter plot according to the 
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clinical outcome value for the associated individual from the 
clinical measurement represented by the selected item; and 

1 0) (e) computer-readable program code for causing the 

computer, in response to selection by the user of an item from 
^ among the second plurality of selectable items, to locate each 

data point along the y axis of the scatter plot according to the 
clinical outcome value for the associated individual from the 
clinical measurement represented by the selected item. 

10 lOS. A computer-usable medium having computer-readable program code 

stored thereon, for causing a computer to provide infomiation of use in conducting a 
clmical trial of a treatment protocol for a medical condition of interest, the 
computer-readable program code comprising: 

IS (a) computer-readable program code for causing a computer to 

access a database of DNA sequence data for selected genes or 
other loci in a reference population of individuals, and to access 
a database of (or accept as input) DNA sequence data for 
selected genes or other loci in a clinical trial population of 
individuals; 

(b) computer-readable program code for causing a computer to 
assign to each member of the reference population haplotypes 
for each of the selected genes or other loci; 

25 

(c) computer-readable program code for causing a computer to 
calculate the frequencies, population distributions and 
statistical measures, including confidence limits, for each of the 
assigned haplotypes in the reference population; 

30 

(d) computer-readable program code for causing a computer to 
assign to each member of a trial population haplotypes for each 
of the selected genes or other loci, based upon the frequencies, 
population distributions and statistical measures calculated in 

35 

the reference population; 
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(e) computer-readable program code for causing a computer to 
determinine the correlations between individual responses to 
the treatment and individual haplotypes, for each of the selected 
genes or other loci; 

^ (f) computer-readable program code for causing a computer to 

accept as input an individual's DNA sequence data or 
haplotypes for one or more of the selected genes or other loci; 
and 

10 (g) computer-readable program code for causing a computer to 

display or output the expected response of the individual to the 
treatment, based on the determined correlations between 
individual responses to the treatment and individual haplotypes. 

15 1 06. The computer-usable mediiun of claim 1 05, which further comprises: 

(a) computer-readable program code stored thereon for causing a 
computer to derive from the haplotype distribution found for 
the reference population a reduced set of genotyping markers, 

20 which allow an individual's haplotypes to be accurately 

predicted without conducting a complete molecular haplotype 
analysis; and 

(b) computer-readable program code stored thereon for causing a 
25 computer to use the reduced set of genotype markers to assign 

haplotypes. 

107. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to infer genotypes of individual subjects for a 
2Q selected gene having at least m polymorphic sites, the computer-readable program 
code comprising: 

(a) computer-readable program code for causing a computer to 
access a database of w-site haplotypes of the selected gene 
2 J from a representative cohort of individuals; 
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(b) computer-readable program code for causing a computer to 
tabulate the frequency of occurrence for each of the haplotypes; 

(c) computer-readable program code for causing a computer to 
construct a list of all genotypes that could result from all 
possible pairs of observed haplotypes; 

(d) computer-readable program code for causing a computer to 
calculate the expected frequency of these genotypes assuming 
the Hardy- Weinberg equilibrium; 

(e) computer-readable program code for causing a computer to 
generate a complete set of all possible masks of the same length 
m as the haplotypes, wherein each mask blocks the identity of 
the nucleotides at m-n polymorphic sites and admits the identity 
of nucleotides at the other n sites; 

(f) computer-readable program code for causing a computer to for 
calculate, for each mask, how much ambiguity results from 
genotyping with only the n polymorphic sites whose identity is 
admitted by the mask; 

(g) computer-readable program code for causing a computer to 
output or display on a display device the calculated ambiguity 
for one or more masks. 

108. The computer-usable medium of claim 107, which further comprises 
computer-readable program code stored thereon for causing a computer to calculate 
the level of ambiguity for a mask, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
identify all pairs of genotypes that are rendered identical by 
application of the mask; 

(b) computer-readable program code for causing a computer to 
calculate the geometric mean of the calculated Hardy- Weinberg 
frequencies of each pair of genotypes rendered identical by 
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application of the mask; 

(c) computer-readable program code for causing a computer to 

sum all such geometric means for all ambiguous pairs to obtain 
an ambiguity score for the mask. 

109. The computer-usable medium of claims 107 or 108» which further 
comprises computer-readable program code stored thereon for causing a computer 
to assign a haplotype pair to an individual having an ambiguous genotype, the 
computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
calculate, for two haplotype pairs A and B that could explain a 
given genotype, the Hardy- Weinberg equilibrium probabilities 
Pa and Pb, where Pa + Pb = 1 ; 

^ ^ (b) computer-readable program code for causing a computer to 

assign a haplotype pair by a process comprising 

(i) selecting a random number between 0 and 1 ; 

(ii) if tibie random number is less than or equal to pa, assigning 
' the haplotype pair A; and 

(iii) if the number is greater than pa, assigning the hs^lotype 

pair B. 

2^ 1 10. A computer-usable medium having computer-readable program code 

stored thereon, for causing a computer to determine polymorphic sites or sub- 
haplotypes that correlate with a clinical response or outcome of interest, or other 
phenotype, the computer-readable program code comprising: 
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(a) computer-readable program code for causing a computer to 
access a database containing haplotype information, and clinical 
response or outcome data (clinical outcome values) or other 
phenotype data, from a cohort of subjects; 

(b) computer-readable program code for causing a computer to 
statistically analyze each individual SNP in the haplotype for the 
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degree to which it correlates with the clinical outcome values or 
other phenotype data, and generating a numerical measure of the 
degree of correlation; 

(c) computer-readable program code for causing a computer to store 
5 for further processing those individual SNPs whose numerical 

measure of the degree of correlation with the clinical outcome 
values or other phenotype data exceeds a first cut-off value; 

(d) computer-readable program code for causing a computer to 

10 generate all possible pair-wise combinations of the saved SNPs 

so as to provide a set of /3-site sub-haplotypes where n — 2; 

(e) computer-readable program code for causing a computer to 
statistically analyze each newly generated n-site sub-haplotype 

IS for the degree to which it correlates v^th the clinical outcome 

values or other phenotype data, and calculate a numerical 
measure of the degree of correlation; 

(f) computer-readable program code for causing a computer to store 
20 for further processing those /i-site sub-haplotypes whose 

nimierical measure of the degree of correlation exceeds the first 
cut-off value; 

(g) computer-readable program code for causing a computer to 

25 genemte all possible pair-wise combinations among and between 

the saved SNPs and saved sub-haplotypes, to produce new 
subhaplotypes with increased values ofn; 

(h) computer-readable program code for causing a computer to repeat 
steps (e) through (g) until either (i) no new sub-haplotypes can be 
genemted, or (ii) no further sub-haplotypes having w less than a 
pre-selected or user-selected limit can be generated. 

111. The computer-usable medium of claim 110, which further comprises 
computer-readable program code stored thereon for causing a computer to display 
those saved SNPs and sub-haplotypes whose numerical measure of the degree of 
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correlation with the clinical outcome value or other phenotype exceeds a second cut- 
off value, wherein the second cut-off value is greater than the first cut-off value. 

112. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to determine polymorphic sites or sub- 
^ haplotypes that correlate with a clinical response or outcome of interest, or other 
phenotype, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
access a database containing haplotype information, and clinical 

10 response or outcome data (clinical outcome values) or other 

pheno^e data, from a cohort of subjects; 

(b) computer-readable program code for causing a computer to 
statistically analyze each individual SNP in the haplotype for the 
degree to which it correlates with the clinical outcome values or 
other phenotype data, and calculate the p-value for the degree of 
correlation; 

(c) computer-readable program code for causing a computer to store 
for further processing those individual SNPs whose p-value for 
the degree of correlation does not exceed a first cut-off value; 

(d) computer-readable program code for causing a computer to 
generate all possible pair-wise combinations of the saved SNPs 
so as to provide a set of n-site sub-haplotypes where 

(e) computer-readable program code for causing a computer to 
statistically analyze each newly generated w-site sub-haplotype 
for the degree to which it correlates with the clinical outcome 
values or other phenotype data, and calculate the p-value for the 
degree of correlation; 

(f) computer-readable program code for causing a computer to store 
for further processing those w-site sub-haplotypes whose p-value 
for the degree of correlation does not exceed the first cut-off 
value; 
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(g) computer-readable program code for causing a computer to 

generate all possible pair-wise combinations among and between 
the saved SNPs and saved sub-haplotypes, to produce new 
subhaplotypes with increased values of n; 

5 (h) computer-readable program code for causing a computer to repeat 

steps (e) through (g) until either (i) no new sub-haplotypes can be 
generated, or (ii) no further sub-haplotypes having n less than a 
pre-selected or user-selected limit can be generated. 

10 1 13. The computer-usable medium of claim 110, which further comprises 

computer-readable program code stored thereon for causing a computer to display 
those saved SNPs and sub-haplotypes whose p-value for the degree of correlation 
with the clinical outcome value or other phenotype does not exceed a second cut-off 
value, wherein the second cut-off value is less than the first cut-off value. 
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1 14. The computer-usable medium of claims 1 10-1 13, which further 
comprises computer-readable program code stored thereon for causing a computer 
to exclude from further processing complex subhaplotypes which are constructed 
from smaller sub-haplotypes, where the smaller sub-haplotypes each have 
correlation values that are at least as significant as that of the complex sub- 
haplotype. 

lis. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to determine polymorphic sites or sub- 
haplotypes that correlate with a clinical response or outcome of interest, or other 
phenotype of interest, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
access a database containing single gene haplotype information 
for one or more genes, and clinical response, outcome data, or 
other phenotype data from a cohort of subjects; 

(b) computer-readable program code for causing a computer to 
statistically analyze each single gene haplotype for the degree to 
which it correlates with the clinical response, outcome, or 
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phenotype of interest, and to generate a numerical measure of the 
degree of correlation; 

(c) computer-readable program code for causing a computer to store 
for further processing those haplotypes whose numerical measure 

5 of the degree of correlation exceeds a first cut-off value; 

(d) computer-readable program code for causing a computer to 
generate, for each haplotype composed of m polymorphic sites, 
all possible sub-haplotypcs having a single site masked, so as to 

10 provide a set of m-n site sub-haplotypes where w = 1 ; 

(e) computer-readable program code for causing a computer to 
statistically analyze each newly generated sub-haplotype for the 
degree to which it correlates with the clinical response, outcome, 

IS or phenotype of interest, and calculating a numerical measure of 

the degree of correlation; 

(f) computer-readable program code for causing a computer to save 
for further processing those sub-haplotypes whose numerical 

20 measure of the degree of correlation exceeds the first cut-off 

value; 

(g) computer-readable program code for causing a computer to 
generate, from the saved sub-haplotypes, all possible sub- 

25 haplotypes having one additional site masked; 

(h) computer-readable program code for causing a computer to repeat 
steps (e) through (g) until either (i) no new sub-haplotypes have a 
degree of correlation which exceeds the first cut-off value, or (ii) 
no further sub-haplotypes having more unmasked sites than a 
pre-selected limit can be generated. 

116. The computer-usable medium of clsum IIS, which further comprises 
. computer-readable program code stored thereon for causing a computer to display 
those saved sub-haplotypes whose numerical measure of the degree of correlation 
with the clinical response data, outcome value, or other phenotype data exceeds a 
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second cut-off value, wherein the second cut-off value is greater than the first cut- 
off value. 

117. A computer-usable medium having computer-readable program code 
stored thereon, for causing a computer to determine polymorphic sites or sub- 
haplotypes that correlate with a clinical response or outcome of interest, or other 
phenotype of interest, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
access a database containing single gene haplotype information 
for one or more genes, and clinical response, outcome data, or 
other phenotype data from a cohort of subjects; 

(b) computer-readable program code for causing a computer to 
statistically analyze each single gene haplotype for the degree to 
which it correlates with the clinical re^onse, outcome, or 
phenotype of interest, and to calculate the p-value for the degree 
of correlation; 

(c) computer-readable program code for causing a computer to store 
for further processing those haplotypes whose p-value for the 
degree of correlation does not exceed a first cut-off value; 

(d) computer-readable program code for causing a computer to 
generate, for each haplotype composed of m polymorphic sites, 
all possible sub-haplotypes having a single site masked, so as to 
provide a set of m-n site sub-haplotypes where n-\\ 

(e) computer-readable program code for causing a computer to 
statistically analyze each newly generated sub-haplotype for the 
degree to which it correlates with the clinical response, outcome, 
or phenotype of interest, and calculating the p-value for the 
degree of correlation; 

(f) computer-readable program code for causing a computer to save 
for further processing those sub-haplotypes whose p-value for the 
degree of correlation does not exceed the first cut-off value; 
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(g) computer-readable program code for causing a computer to 
generate, from the saved sub-haplotypes, all possible sub- 
haplotypes having one additional site masked; 

(h) computer-readable program code for causing a computer to repeat 
5 steps (e) through (g) until either (i) no new sub-haplotypes have a 

p-value which does not the first cut-off value, or (ii) no further 
sub-haplotypes having more unmasked sites than a pre-selected 
limit can be generated. 

10 118. The computer-usable medium of claim 1 1 7» which fiirther comprises 

computer-^readable program code stored thereon for causing a computer to display 
those saved sub-haplotypes whose p-value for the degree of correlation with the 
clinical response, outcome, or phenotype of interest does not exceed a second cut- 
off value, wherem the second cut-off value is less than the first cut-off value. 

119, The computer-usable medium of claims 115-118, which further 
comprises computer-readable program code stored thereon for causing a computer 
to exclude from further processing complex sub-haplotypes which are constructed 
from smaller sub-haplotypes, where the smaller sub-heqplotypes each have 
correlation values that are at least as significant as that of the complex sub- 
haplotype. 
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1 20. A computer programmed to cause haplotype pair assignments to be 
made to an individual member of a population whose genotype information for a 
gene or gene feature of interest is stored in a computer-readable form, the computer 
comprising a memory having at least one region for storing computer executable 
program code and a processor for executing the program code stored in memory, 
wherein the program code includes: 

computer-readable program code for causing a computer to generate 
all possible haplotype pairs consistent with the stored genotype; 

computer-readable program code for causing a computer to calculate 
the frequency of the haplotypes and haplotype pairs according to the 
Hardy- Weinberg equilibrium, based upon the observed distribution 
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of haplotypes or haplotype pairs in the population; and 

computer-readable program code for causing a computer to select the 
most probable haplotype pair for the individual, 

121. The computer of claim 1 20, wherein the program code further 
includes computer-readable program code for causing a computer to correct the 
stored distribution of haplotypes or haplotype pairs for effects imposed by the 
presence of a limited number of individuals in the population. 

1 22. The computer of claim 1 20, wherein the program code further 
includes computer-readable program code for causing a computer to validate 
haplotype pair assignments by analyzing for compliance of the assigned haplotype 
pair with Mendelian inheritance principles. 

123. The computer of claim 120, wherein the population is selected from 
the group consisting of a reference population, a clinical population, a disease 
population, an ethnic population, a family population and a same-sex population. 

1 24. A computer programmed to cause haplotype pair assignments to be 
made to an individual member of a population whose genotype information for a 
gene or gene feature of interest is stored in a computer-readable form, the computer 
comprising a memory having at least one region for storing computer executable 
program code and a processor for executing the program code stored in memory, 
wherein the program code includes: 

computer-readable program code for causing a computer to generate 
all possible haplotype pairs consistent with the stored genotype; 

computer-readable program code for causing a computer to access a 
database containing reference haplotype pair frequency data and to 
determine from the frequency data the probability, for each of the 
possible haplotype pairs, that the individual has the possible 
haplotype pair; and 

computer-readable program code for causing a computer to select the 
most probable haplotype pair for the individual. 
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125. A computer programmed to identify a correlation between a clinical 
response to a treatment or other phenotype and a haplotype or hapiotype pair present 
at a candidate locus hypothesized to be associated with the clinical response other 
phenotype, the computer comprising a memory having at least one region for 
5 storing computer executable program code and a processor for executing the 
program code stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
access a database containing data on clinical responses to 

jQ treatments, or other phenotypes, exhibited by individuals in a 

clinical population; 

(b) computer-readable program code for causing a computer to 
access a database containing haplotype data for each individual 
of the clinical population, the haplotype data comprising 
information on a plurality of polymorphic sites present at the 
candidate locus; and 
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(c) computer-readable program code for causing a computer to 
calculate the degree of correlation between haplotypes or 
haplotype pairs and the clinical response to the treatment or~ 
other phenotype, by statistical analysis of the haplotype and 
clinical response data. 

126. The computer of claim 12S, wherein the treatment comprises 
administration of a drug or drug candidate. 

127. The computer of claim 12S, wherein the candidate locus is a gene or 
a gene feature. 

1 28. The computer of claim 125, wherein the program code further 
includes computer-readable program code for causing a computer to store, display, 
or output the degree of correlation. 

129. The computer of claim 125, wherein the program code further 

2^ includes computer-readable program code for causing a computer to calculate the 
statistical significance of the correlation. 
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130. A computer programmed to identify a correlation between an 
individual's susceptibility to a condition or disease of interest, or other phenotype, 
and a haplotype or haplotype pair present at a candidate locus hypothesized to be 
associated with susceptibility to the condition or disease of interest, or with a 
5 phenotype of interest, the computer comprising a memory having at least one region 
for storing computer executable program code and a processor for executing the 
program code stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
jQ access haplotype data for the candidate locus for each member 

of a population having the phenotype or condition or disease of 
interest C^disease haplotype data'"); 

(b) computer-readable program code for causing a computer to 
statistically analyze the disease haplotype data to calculate 
haplotype or haplotype pair frequencies; 

(c) computer-readable program code for causing a computer to 
access a database containing haplotype data for the candidate 
locus for each member of a healthy reference population 
("reference haplotype data''); 

(d) computer-readable program code for causing a computer to 
statistically analyze the reference haplotype data to calculate 
haplotype or haplotype pair frequencies; and 

(e) computer-readable program code for causing a computer to 
identify a correlation of a haplotype or haplot5T)e pair with 
susceptibility to the disease or condition of interest, or with the 
phenotype of interest, when the haplotype or haplotype pair has 
a higher fi-equency in the population having the phenotype, 
condition or disease of interest than in the reference population. 
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131. The computer of claim 130, wherein the candidate locus is a gene or 
a gene feature. 
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1 32. The computer of claim 1 30, wherein the program code further 
includes computer-readable program code for causing a computer to store, display, 
or output the identified correlation. 

1 33. The computer of claim 1 30, wherein the program code ftirther 

^ includes computer-readable program code for causing a computer to calculate the 
statistical significance of the correlation. 

1 34. A computer programmed to predict an individual's response to a 
medical or pharmaceutical treatment based on one or more selected haplotypes or 

10 haplotype pairs of the individual, the computer comprising a memory having at least 
one region for storing computer executable program code and a processor for 
executing the program code stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
IS access a database of correlations between haplotypes or 

haplotype pairs and responses to the medical or pharmaceutical 
treatment in a reference population; 

(b) computer-readable program code for causing a computer to 
20 locate haplotypes or haplotype pairs in the database that match 

the selected haplotypes or haplotype pairs of the individual, and 

(c) computer-readable program code for causing a computer to 
predict that the individual's response will be the response or 

25 responses associated in the database with the selected haplotype 

or haplotype pair. 

135. The computer of claim 1 34, wherein the program code fiirther 
includes computer-readable program code for causing a computer to generate an 
error estimate for the prediction. 

1 36. A computer programmed to display a gene's structure and gene 
features on a display device, the computer comprising a memory having at least one 

. region for storing computer executable program code and a processor for executing 
the program code stored in memory, wherein the program code includes: 
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(a) computer-readable program code for causing a computer to 
retrieve from a database, and display in a first area of the 
display device, data indicative of the frequencies of occurrence 
of a gene's haplotypes within predetermined member groupings 

5 of a reference population; 

(b) computer-readable program code for causing a computer to 
retrieve from a database data indicative of the gene's structure 
and gene features; 

10 (c) computer-readable program code for causing a computer to 

display in a second area of the display device a graphical 
representation of the gene's structure, user-selectable items 
indicating the location of gene features, and graphical 
indicators of the location of polymorphic sites on the gene; 

(d) computer-readable program code for causing a computer to 
display in a third area of the display device, in response to a 
user's selection of an item indicating a gene feature, a graphical 
representation of the structure of the gene feature having user- 
selectable items Indicating the position of polymorphic sites; 
and 
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(e) computer-readable program code for causing a computer to 
retrieve firom a database, and display in a third area of the 
display device, in response to a user's selection of an item 
indicating the position of a polymorphic site, data indicative of 
the frequencies within the member groupings of the occurrence 
of particular nucleotides at the polymorphic site, 

137. A computer programmed to display on a display device haplotype 
pair frequency data within a population of individuals, for a selected gene or gene 
feature, the computer comprising a memory having at least one region for storing 
computer executable program code and a processor for executing the program code 
35 stored in memory, wherein the program code includes: 
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(a) computer-readable program code for causing a computer to 
display on the display device a plurality of selectable items, 
each item corresponding to a polymorphic site in the gene or 
gene feature; 

^ (c) computer-readable program code for causing a computer to 

retrieve jfrom a database and display on the display device, in 
response to a user's selection of one or more items indicating 
polymorphic sites, individual haplotype pairs in the database 
jQ that differ at one or more of the selected polymorphic sites; and 

(d) computer-^readable program code for causing a computer to 
display on the display device data indicative of the fi^quencies 
of the displayed haplotype pairs within one or more member 
J ^ groupings within the population. 

138. A computer programmed to display on a display device polymorphic 
site linkage data for a gene or gene structure of interest, the computer comprising a 
memory having at least one region for storing computer executable program code 
and a processor for executing the program code stored in memory, v^erein the 
program code includes: 

(a) computer-readable program code for causing a computer to 
display on the display device one or more matrix structures, 
wherein the axes of each matrix structure represent the 
polymorphic sites in the gene or gene feature of interest, and 
wherein each matrix structure corresponds to a different 
population or population group; and 

(b) computer-readable program code for causing a computer to 
display on the display device, in each cell of a matrix structure, 
a graphical indication of degree of linkage between the twp 
polymorphic sites corresponding to the coordinates of the cell 
in the matrix. 
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139. The computer of claim 138, wherein color is used as the graphical 
indication of degree of linkage, and wherein the medium further comprises 
computer-readable program code for causing a computer to display a reference 
color scale relating color to degree of linkage. 

140. A computer programmed to display on a display device a 
phylogenetic tree, the computer comprising a memory having at least one region for 
storing computer executable program code and a processor for executing the 
program code stored in memory^ wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
display a plurality of selectable items, each corresponding to a 
polymorphic site in the gene or gene feature of interest; and 

(b) computer-readable program code for causing a computer to 
display a phylogenetic tree structure having a node for each 
haplotype in a population, where the distance between nodes is 
proportional to the minimum number of nucleotides that would 
have to be changed to interconvert the corresponding 
haplotypes. 

141. The computer of claim 140, wherein the program code further 
includes computer-readable program code for causing a comptiter to display 
connections between the nodes that indicate a single nucleotide difTerence between 
the haplotypes repesented by the nodes. 

142. The computer of claim 140, wherein the program code further 
includes computer-readable program code for causing a computer to display at each 
node an indication of the relative frequency of occurrence of the haplotype 
represented by the node among different population groups. 

143. A computer programmed to display a genotype analysis screen on a 
display device, the computer comprising a memory having at least one region for 
storing computer executable program code and a processor for executing the 
program code stored in memory, wherein the program code includes: 
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(a) computer-readable program code for causing a computer to 
display a first plurality of selectable items, each corresponding 
to a polymorphic site, and a second plurality of selectable 
items, each corresponding to a polymorphic site; 

5 (b) computer-readable program code for causing a computer to 

display on the display device a matrix structure, wherein the 
axes of the matrix structure represent haplotypes in the gene or 
gene feature of interest that vary at the polymorphic sites 
jQ selected from the first plurality of selectable items; and 

(c) computer-readable program code for causing a computer to 
display on the display device, in each cell of the matrix 
structure, a graphical indication of the reliability of the 
assignment to an individual of the hapiotype pair corresponding 
to the coordinates of the cell in the matrix, when the individual 
is genotyped only at the polymorphic sites selected from the 
second plurality of selectable items. 

144. The computer of claim 143, wherein color is used as the graphical 
indication of reliability of hapiotype pair assignment, and wherein' wherein the 
program code fiirther includes computer-readable program code for causing a 
computer to display a reference color scale relating color to reliability of hapiotype 
pair assignment; 

145. A computer programmed to display clinical response values, or other 
phenotype data, of a subject population as a function of hapiotype pairs of the 
individuals in the population, the computer comprising a memory having at least 
one region for storing computer executable program code and a processor for 

30 executing the program code stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
retrieve from a computer-readable storage device, data 
representing hapiotype pairs and clinical response values, or 
35 other phenotype data, for the subject population; and 
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(b) computer-readable program code for causing a computer to 
graphically display a haplotype pair matrix structure, each of 
whose cells contains a graphical representation of the clinical 
response values or other phenotype data of individuals having 
5 the haplotype pair corresponding to the coordinates of that cell 

in the haplotype pair matrix. 

146. A computer programmed to display on a display device clinical 
response values, or other phnotypic data, of a subject population as a function of the 
IQ haplotype pairs of the individuals in the population for a gene or gene feature of 
interest, the computer comprising a memory having at least one region for storing 
computer executable program code and a processor for executing the program code 
stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
display one or more first selectable items representing 
polymorphic sites of the gene of gene feature; 

(b) computer-readable program code for causing a computer to 
display one or more second selectable items representing 
clinical measurements or phenotypes; and 

(c) computer-readable program code for causing a computer to 
display on the display device, in response to the selection by 
the user of at least one first and second selectable items, a 

25 

haplotype pair matrix structure, wherein the axes of the matrix 
structure represent haplotypes in the gene or gene feature of 
interest that vary at the polymorphic sites corresponding to the 
first selected item or items, and wherein each of the cells of the 

30 matrix contains a graphical representation of the mean clinical 

response value, or other phenotype data, for the clinical 
measurement represented by the selected second item, of 
individuals having the haplotype pair corresponding to the 

2^ coordinates of the cell in the haplotype pair matrix. 
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147. The computer of claim 145 or 146, wherein color is used as the 
graphical indication of mean clinical response value, or other phenotype data, and 
wherein the program code further includes computer-readable program code for 
causing a computer to display a reference color scale relating color to mean clinical 

5 response value. 

148. The computer of claim 147, wherein the program code further includes: 

(a) computer-readable program code for causing a computer to 
display a means for adjusting the range of mean clinical 

10 response values or other phenotype data represented by the 

reference color scale; and 

(b) computer-readable program code for causing a computer, in 
response to the adjustment of the range of clinical response 

IS values or other phenotype data represented by the reference 

color scale, to adjust the color of the cells of the haplotype pair 
matrix. 

149. The computer of claim 14S or 146, wherein the graphical 

20 representation of data is a histogram indicating the distribution of individuals across 
the range of clinical response values or other phenotype data. 

150. The computer of any one of claims 145, 146, or 147, wherein at least 
one cell in the displayed matrix includes a selectable area, and wherein the program 

25 code further includes computer-readable program code for causing a computer to 
display, for individuals having the haplotype pair represented by the coordinates of 
the cell in the matrix, a histogram indicating the distribution of the individuals 
across the range of clinical response values. 

2Q 151. The computer of any one of claims 145, 146, or 147 wherein the 

program code further includes computer-readable program code for causing a 

computer to display a third selectable item, and computer-readable program code 

for causing a computer to display, in response to selection of the third selectable 

item by the user, the statistical significance of the correlations between variation at 
35 . . . 

individual polymorphic sites and the clinical response values. 
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152. The computer of any one of claims 145, 146, or 147, wherein the 
program code further includes computer-readable program code for causing a 
computer to display a fourth selectable item, and computer-readable program code 
for causing a computer to display, in response to selection of the fourth selectable 

5 item by the user, the numerical mean and standard deviation of clinical response 
values among individuals having each haplotype pair in the matrix. 

153. The computer of any one of claims 145, 146, or 147, wherein the 
program code further includes computer-readable program code for causing a 

IQ computer to display a fifth selectable item, and computer-readable program code for 
causing a computer to display, in response to selection of the fifth selectable item by 
the user, the results of an analysis of variation calculation to permit determination of 
whether variation in the clinical response values between individuals having 
different haplotype pairs is statistically significant. 

154. A computer programmed to cany out a genetic algorithm for finding 
an optimal set of weights to fit a function of polymorphic site data for a gene or 
gene feature of interest to a clinical response measurement, the computer 
comprising a memory having at least one region for storing computer executable 
program code and a processor for executing the program code stored in memory, 
wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
display a variable controller for setting the number of genetic 
algorithm generations parameter; 

(b) computer-readable program code for causing a computer to 
display a variable controller for setting the number of agents 
parameter; 
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(c) computer-readable program code for causing a computer to 
display a variable controller for setting the mutation rate 
parameter; 
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(d) computer-readable program code for causing a computer to 
display a variable controller for setting the crossover rate 
parameter; 

(e) computer-readable program code for causing a computer to 
5 display one or more selectable items each corresponding to a 

polymorphic site of the gene or gene feature of interest; and 

(f) computer-readable program code for causing a computer to 
displaying a selectable item for initiation of the genetic 

10 algorithm calculation; and 

(g) computer-readable program code for causing a computer, in 
response to the selection by the user of one or more selectable 
items corresponding to a polymorphic site, and selection by the 

15 user of the item for initiation of the genetic algorithm 

caclulation, to execute the genetic algorithm calculation with 
the parameters set by the variable controllers, and to display on 
a display device (i) the residual error of the model as a function 
of the number of genetic algorithm generations, and (ii) the 

20 

results of the genetic algorithm calculation showing the optimal 
weights for each of the polymorphic sites. 

1 SS. A computer programmed to display on a display device correlations 
between clinical outcome values obtained from selected clinical outome measures 
for a selected population, the computer comprising a memory having at least one 
region for storing computer executable program code and a processor for executing 
the program code stored in memory, wherein the program code includes: 

11) (a) computer-readable program code for causing a 

30 

computer to display a first plurality of selectable items 
corresponding to clinical outcome measurements; 

12) (b) computer-readable program code for causing a 
computer to display a second plurality of selectable items 
corresponding to clinical outcome measurements; and 
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13) (c) computer-readable program code for causing a 
computer to display a scatter plot of data points, each data point 
corresponding to an individual in the selected population; 

14) (d) computer-readable program code for causing a 

^ computer, in response to selection by the user of an item from 

among the first plurality of selectable items, to locate each data 
point along the x axis of the scatter plot according to the 
clinical outcome value for the associated individual from the 
jg clinical measurement represented by the selected item; and 

1 5) (e) computer-readable program code for causing the 
computer, in response to selection by the user of an item from 
among the second plurality of selectable items, to locate each 
data point along the y axis of the scatter plot according to the 
clinical outcome value for the associated individual from the 
clinical measurement represented by the selected item. 

1 S6. A computer programmed to provide information of use in conducting 

a clinical trial of a treatment protocol for a medical condition of interest, the 
20 ^ 

computer comprising a memory having at least one region for storing computer 

executable program code and a processor for executing the program code stored in 

memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
access a database of DNA sequence data for selected genes or 
other loci in a reference population of individuals, and to access 
a database of (or accept as input) DNA sequence data for 
selected genes or other loci in a clinical trial population of 

30 individuals; 

(b) computer-readable program code for causing a computer to 
assign to each member of the reference population haplotypes 
for each of the selected genes or other loci; 

(c) computer-readable program code for causing a computer to 
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calculate the frequencies, population distributions and 
statistical measures, including confidence limits, for each of the 
assigned haplotypes in the reference population; 

(d) computer-readable program code for causing a computer to 
5 assign to each member of a trial population haplotypes for each 

of the selected genes or other loci, based upon the frequencies, 
population distributions and statistical measures calculated in 
the reference population; 

10 (e) computer-readable program code for causing a computer to 

determinine the correlations between individual responses to 
the treatment and individual haplotypes, for each of the selected 
genes or other loci; 

IS (f) computer-readable program code for causing a computer to 

accept as input an individual's DNA sequence data or 
haplotypes for one or more of the selected genes or other loci; 
and 

20 (g) computer-readable program code for causing a computer to 

display or output the expected response of the individual to the 
treatment, based on the determined correlations between 
individual responses to the treatment and individual haplotypes. 

25 1 57. The computer of claim 1 56, wherein the program code further 

includes: 

(a) computer-readable program code for causing a computer to 
derive from the haplotype distribution found for the reference 
population a reduced set of genotypmg markers, which allow 
an individual's haplotypes to be accurately predicted without 
conducting a complete molecular haplotype analysis; and 

(b) computer-readable program code for causing a computer to use 
the reduced set of genotype markers to assign haplotypes. 
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1 58. A computer programmed to infer genotypes of individual subjects for 
a selected gene having at least m polymorphic sites, the computer comprising a 
memory having at least one region for storing computer executable program code 
and a processor for executing the program code stored in memory, wherein the 
5 program code includes: 

(a) computer-readable program code for causing a computer to 
access a database of m-site haplotypes of the selected gene 
from a representative cohort of individuals; 

10 (b) computer-readable program code for causing a computer to 

tabulate the frequency of occurrence for each of the haplotypes; 

(c) computer-readable program code for causing a computer to 
construct a list of all genotypes that could result from all 

15 possible pairs of observed haplotypes; 

(d) computer-readable program code for causing a computer to 
calculate the expected frequency of these genotypes assuming 
the Hardy- Weinberg equilibrium; 

20 (e) computer-readable program code for causing a computer to 

generate a complete set of all possible masks of the same length 
m as the haplotypes, wherein each mask blocks the identity of 
the nucleotides at m-n polymorphic sites and admits the identity 

25 of nucleotides at the other n sites; 

(f) computer-readable program code for causing a computer to for 
calculate, for each mask, how much ambiguity results from 
genotyping with only the n polymorphic sites whose identity is 
admitted by the mask; 

(g) computer-readable program code for causing a computer to 
output or display on a display device the calculated ambiguity 
for one or more masks. 
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1 59. The computer of claim 1 58, wherein the program code further 
includes computer-readable program code for causing a computer to calculate the 
level of ambiguity for a mask, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
5 identify all pairs of genotypes that are rendered identical by 

application of the mask; 

(b) computer-readable program code for causing a computer to 
calculate the geometric mean of the calculated Hardy- Weinberg 

10 frequencies of each pair of genotypes rendered identical by 

application of the mask; 

(c) computer-readable program code for causing a computer to 
sum all such geometric means for all ambiguous pairs to obtain 

IS an ambiguity score for the mask. 

1 60, The computer of any one of claims 1 58 or 1 59, wherein the program 
code further includes computer-readable program code for causing a computer to 
assign a haplotype pair to an individual having an ambiguous genotype, the 

20 computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to 
calculate, for two haplotype pairs A and B that could explain a 
given genotype, the Hardy- Weinberg equilibrium probabilities 

25 Pa and pe, where Pa + Pb = U 

(b) computer-readable program code for causing a computer to 
assign a haplotype pair by a process comprising 

(i) selecting a random number between 0 and 1 ; 

^® (ii) if the random number is less than or equal to pa, assigning 

the haplotype pair A; and 

(iii) if the number is greater than pA, assigning the haplotype 
pair B. 
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161 . A computer programmed to determine polymorphic sites or sub- 
haplotypes that correlate with a clinical response or outcome of interest, or other 
phenotype, the computer comprising a memory having at least one region for 
storing computer executable program code and a processor for executing the 
5 program code stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
access a database containing haplotype information, and clinical 
response or outcome data (clinical outcome values) or other 

jQ phenotype data, from a cohort of subjects; 

(b) computer-readable program code for causing a computer to 
statistically analyze each individual SNP in the haplotype for the * 
degree to which it correlates with the clinical outcome values or 
other phenotype data, and generating a numerical measure of the 
degree of correlation; 

(c) computer-readable program code for causing a computer to store 
for fiirthei: processing those individual SNPs whose numerical 
measure of the degree of correlation with the clinical outcome 

20 

values or other phenotype data exceeds a first cut-off value; 

(d) computer-readable program code for causing a computer to 
generate all possible pair-wise combinations of the saved SNPs 

2^ so as to provide a set of w-site sub-haplotypes where « = 2; 

(e) computer-readable program code for causing a computer to 
statistically analyze each newly generated w-site sub-haplotype 
for the degree to which it correlates with the clinical outcome 
values or other phenotype data, and calculate a numerical 

30 

measure of the degree of correlation; 

(f) computer-readable program code for causing a computer to store 
for further processing those w-site sub-haplotypes whose 
numerical measure of the degree of correlation exceeds the first 

35 

cut-off value; 
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(g) computer-readable program code for causing a computer to 

generate all possible pair-wise combinations among and between 
the saved SNPs and saved sub-haplotypes, to produce new 
subhaplotypes with increased values of n; 

5 (h) computer-readable program code for causing a computer to repeat 

steps (e) through (g) until either (i) no new sub-haplotypes can be 
generated, or (ii) no further sub-haplotypes having n less than a 
pre-seiected or user-selected limit can be generated. 

10 1 62, The computer of claim 161, wherein the program code further 

includes computer-readable program code for causing a computer to display those 
saved SNPs and sub-haplotypes whose numerical measure of the degree of 
correlation with the clinical outcome value or other phenotype exceeds a second cut- 
off value, wherein the second cut-off value is greater than the first cut-oflF value. 

163. A computer programmed to determine polymorphic sites or sub- 
haplotypes that correlate with a clinical response or outconpie of interest, or other 
phenotype, the computer comprising a memory having at least one region for 
storing computer executable program code and a processor for executing the 
progiam code stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
access a database containing haplotype information, and clinical 
response or outcome data (clinical outcome values) or other 
phenotype data, from a cohort of subjects; 

(b) computer-readable program code for causing a computer to 
statistically analyze each individual SNP in the haplotype for the 
degree to which it correlates with the clinical outcome values or 
other phenotype data, and calculate the p-value for the degree of 
correlation; 

(c) computer-readable program code for causing a computer to store 
for further processing those individual SNPs whose p-value for 
the degree of correlation does not exceed a first cut-off value; 
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(d) computer-readable program code for causing a computer to 
generate all possible pair-wise combinations of the saved SNPs 
so as to provide a set of w-site sub-haplotypes where n = 2; 

(e) computer-readable program code for causing a computer to 

5 statistically analyze each newly generated w-site sub-haplotype 

for the degree to which it correlates with the clinical outcome 
values or other phenotype data, and calculate the p-value for the 
degree of correlation; 

10 (Q computer-readable program code for causing a computer to store 

for further processing those n-site sub-haplotypes whose p-value 
for the degree of correlation does not exceed the first cut-off 
value; 

IS (g) computer-readable program code for causing a computer to 

generate all possible pair-wise combinations among and between 
the saved SNPs and saved sub-haplotypes, to produce new 
subhaplofypes wifh increased values of n; 

20 (h) computer-readable program code for causing a computer to repeat 

steps (e) through (g) until either (i) no new sub-haplotypes can be 
generated, or (ii) no further sub-haplotypes having n less than a 
pre-selected or user-selected limit can be generated. 

25 164. The computer of claim 161 , wherein the program code further 

includes computer-readable program code for causing a computer to display those 
saved SNPs and sub-haplotypes whose p-value for the degree of correlation with the 
clinical outcome value or other phenotype does not exceed a second cut-off value, 
wherein the second cut-off value is less than the first cut-off value. 



30 



35 



1 65. The computer of any one of claims 1 6 1 - 1 64, wherein the program 
code further includes computer-readable program code for causing a computer to 
exclude from further processing complex subhaplotj^es which are constructed from 
smaller sub-haplotypes, where the smaller sub-haplotypes each have correlation 
values that are at least as significant as that of the complex sub-haplotype. 



wo 01/01218 



-210- 



PCT/USOO/17540 



1 66. A computer programmed to determine polymorphic sites or sub- 
haplotypes that correlate with a clinical response or outcome of interest, or other 
phenotype of interest, the computer comprising a memory having at least one region 
for storing computer executable program code and a processor for executing the 
5 program code stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
access a database containing single gene haplotype information 
for one or more genes, and clinical response, outcome data, or 

IQ other phenotype data from a cohort of subjects; 

(b) computer-readable program code for causing a computer to 
statistically analyze each single gene haplotype for the degree to 
which it correlates with the clinical response, outcome, or 

J ^ phenotype of interest, and to generate a numerical measure of the 

degree of correlation; 

(c) computer-readable program code for causing a computer to store 
for further processing those haplotypes whose numerical measure 
of the degree of correlation exceeds a first cut-off value; 

20 

(d) computer-readable program code for causing a computer to 
generate, for each haplotype composed of m polymorphic sites, 
all possible sub-haplotypes having a single site masked, so as to 
provide a set of m-n site sub-haplotypes where « = 1 ; 

25 

(e) computer-readable program code for causing a computer to 
statistically analyze each newly generated sub-haplotype for the 
degree to which it correlates with the clinical response, outcome, 
or phenotype of interest, and calculating a numerical measure of 

30 

the degree of correlation; 

(f) computer-readable program code for causing a computer to save 
for further processing those sub-haplotypes whose numerical 
measure of the degree of correlation exceeds the first cut-off 
value; 
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(g) computer-readable program code for causing a computer to 
generate, from the saved sub-haplotypes, all possible sub- 
haplotypes having one additional site masked; 

(h) computer-readable program code for causing a computer to repeat 
5 steps (e) through (g) until either (i) no new sub-haplotypes have a 

degree of correlation which exceeds the first cut-off value, or (ii) 
no further sub-haplotypes having more unmasked sites than a 
pre-selected limit can be generated. 

10 1 67. The computer of claim 1 66, wherein the program code further 

includes computer-readable program code for causing a computer to display those 
saved sub-haplotypes whose nimierical measure of the degree of correlation with the 
clinical response data, outcome value, or other phenotype data exceeds a second cut- 
off value, wherein the second cut-off value is greater than the first cut-oflf vsdue. 

168. A computer programmed to determine polymorphic sites or sub- 
haplotypes that correlate with a clinical response or outcome of interest, or other 
phenotype of interest, the computer comprising a memory having at least one region 
for storing computer executable program code and a processor for executing the 
program code stored in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to 
access a database containing single gene haplotype information 
for one or more genes, and clinical response, outcome data, or 
other phenotype data from a cohort of subjects; 

(b) computer-readable program code for causing a computer to 
statistically analyze each single gene haplotype for the degree to 
which it correlates with the clinical response, outcome, or 
phenotype of interest, and to calculate the p-value for the degree 
of correlation; 

(c) computer-readable program code for causing a computer to store 
for ftirther processing those haplotypes whose p-value for the 
degree of correlation does not exceed a first cut-off value; 
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(d) computer-readable program code for causing a computer to 
generate, for each haplotype composed of m polymorphic sites, 
all possible sub-haplotypes having a single site masked, so as to 
provide a set of m-n site sub-haplotypes where « = I ; 

5 (e) computer-readable program code for causing a computer to 

statistically analyze each newly generated sub-haplotype for the 
degree to which it correlates with the clinical response, outcome, 
or phenotype of interest, and calculating the p-value for the 
I Q degree of correlation; 

(f) computer-readable program code for causing a computer to save 
for further processing those sub-haplotypes whose p-value for the 
degree of correlation does not exceed the first cut-off value; 

IS (g) computer-readable program code for causing a computer to 

generate, firom the saved sub-haplotypes, all possible sub- 
haplotypes having one additional site masked; 

(h) computer-readable program code for causing a computer to repeat 
20 steps (e) through (g) until either (i) no new sub-haplotypes have a 

p-value which does not the first cut-ofiF value, or (ii) no further 
sub-haplotypes having more unmasked sites than a pre-selected 
limit can be generated. 

2 J 169. The computer of claim 168, wherein the program code further 

includes computer-readable program code for causing a computer to display those 
saved sub-haplotypes whose p-value for the degree of correlation with the clinical 
response, outcome, or phenotype of interest does not exceed a second cut-off value, 
wherein the second cut-off value is less than the first cut-off value. 



30 



1 70. The computer of any one of claims 1 66- 1 69, wherein the program 
code further includes computer-readable program code for causing a computer to 
exclude fi:om further processing complex sub-haplotypes which are constructed 
from smaller sub-haplotypes, where the smaller sub-haplotypes each have 
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correlation values that are at least as significant as that of the complex sub- 
haplotype. 

171. A data structure for storing and organizing biological information, 
stored on a computer-readable medium and accessible by a processor, which 

^ comprises a single parent table which is adapted for storing, organizing, and 
retrieving a plurality of genetic features by the relative positional relationships 
between the genetic features. 

1 72. The data structure of claim 171, wherein said parent table is part of each 
10 of three submodels comprising the data structure, wherein said submodels are a 

genomic repository submodel, a variation repository submodel and a literature 
repository submodel. 

173. The data structure of claim 172, wherein the genetic features are 

IS selected from the group consisting of chromosomes, genomic regions, genes, gene 
regions, gene transcripts, transcript regions, and polymorphisms. 

174. The data structure of claim 173, further comprising a clinical repository 
submodel. 

20 1 75. The data structure of claim 1 74, fiirther comprising a drug repository 

submodel. 

176. A method for storing and organizing biological information, which 
comprises 

25 

(a) providing a data structure comprising a single parent table 
wiiich is adapted for storing, organizing, and retrieving a 
plurality of genetic features by the relative positional 
relationships between the genetic features; and 

(b) positioning a first genetic feature onto a second genetic feature, 

177. The method of claim 1 75, wherein said first genetic feature is an 
assembly and said second genetic feature is a gene. 

1 78. The method of claim 1 77, fiirther comprising positioning a third genetic 
feature onto said gene. 
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179. The method of claim 178, wherein said third genetic feature is a gene 
region and the method further comprises positioning onto said gene region a 
polymorphism. 

1 80. The method of claim 1 79, further comprising providing a relationship 
^ between the polymorphism and at least one phenotype which is associated with the 

polymorphism. 

181 . The method of claim 177, further comprising positioning onto said gene 
a haplotype which comprises a plurality of pol3anorphisms. 

1 82. The method of claim 1 78, further comprising providing a relationship 
between the haplotype and at least one phenotype which is associated with the 
haplotype. 

1 83. A data structure for storing and organizing biological information, 
stored on a computer-readable medium and accessible by a processor, which 
comprises at least two different fields, one of which includes a plurality of genetic 
features, and the other of which includes relative positional relationships between 
the genetic features. 

20 
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iCoGen CTS Modeler: Test 
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Legend of Figures: 



Samplel 



Rectangle Boxes: Tables in the database. 



_0(FK) l 
I Some Attr I 



Rounded Boxes: Children tables that depend on their parent tables. 
This dependency requires that a parent record to be in existence 
before a child record can be created. 



2: 



6: 



) — -4- 




8: N 



10: 



12: 49 

14: \ ^ 



Identifying parent / child relationship. It depicts the not nuUable l>to- 
O-or-many relationship. 

Non-identifying parent / child relationship. It represents the nuUable 

0- or-l-to-many relationship. 

Identifying parent / child relationship. It depicts the not nullable 1-to- 

1- or-many relationship. 

Non-identifying parent / child relationship. It represents the not 
nullable 1-to-l-or-many relationship. 

Identifying parent / child relationship. It depicts the not nullable 1 -to- 
exact- 1 relationship. 

Non-identifying parent / child relationship. It represents the nullable 
0-or-l-to-exact-l relationship. 

Non-identifying parent / child relationship. It represents the not 
nullable O-or-l -to-many relationship. 



FIG. 25F 
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Legend of Figures: 

Samptel 

Rectangle Boxes: Tables in the database. 



SomeAnrl 



(^ someAur J Roundcd Boxcs: Children tables that depend on their parent tables. 
This dependency requires that a parent record to be in existence before a child 
record can be created. 

^ Identifying parent / child relationship. It depicts the not nuUable 1-to- 
0-or-many relationship. 

4* ^ -eh Non-identifying parent / child relationship. It represents the nuUable 



O-or-l -to-many relationship* 



6: 



Identifying parent / child relationship. It depicts the not nuUable 1-to- 



I'Or-many relationship. 



^ Non-identifying parent / child relationship. It represents the not 

nullable 1-to-l-or-many relationship. 

10: \ 



Identifying parent / child relationship. It depicts the not nullable 1-to- 
exact-1 relationship. 

Non-identifying parent / child relationship. It represents the nullable 

0-or-l-to-exact-l relationship. 

"I Non-identifying parent / child relationship. It represents the not 

nullable 0-or-l -to-many relationship. 
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